My first programming experiences on the Mahout project were pretty straightforward. I was working on a problem that seemed to need some clustering - grouping data that is "similar". I listened to a couple of excellent Google Tutorials that explained how to use Map/Reduce to implement Canopy Clustering. This was a small leap from my previous experiences and I was able to implement it in a few days. In the process, I learned about k-Means Clustering since canopy clusters are often used as input to k-means. Now, the math was starting to slow me down but the algorithm was still pretty simple and so I was able to make good progress.
Then came mean shift clustering. A student had posted an email expressing interest in the algorithm and in a later correspondence included a reference to a comprehensive paper on it. The math was scary, completely opaque to me, and filled with new statistical terms. I must have read that paper a dozen times before it dawned upon me what it was doing. It's not so much that I could not interpret the notation, I just have a hard time mapping it into the real world so that I can visualize its intent. One night I had a vision of hydrogen atoms floating in interstellar space, weakly attracted to each other and slowly forming into clumps - gas clusters - that ultimately became stars. Somehow, that vision morphed into vast clouds of canopy clusters moving and merging together and an implementation was born. I still cannot prove it is correct, but it worked on a test dataset and it felt right so I committed it.
Then, Ted Dunning, an active contributor to the Hadoop and Mahout mailing lists, introduced me to Dirichlet Process Clustering. Unlike the other clustering algorithms which assign each data point to a single, "best" cluster, Dirichlet allows each point to be assigned to multiple clusters - each with an associated probability. This is much more realistic, but it makes the math really, really complicated and I'm still struggling to map the notation onto reality. Each time I read all the papers it gets a little bit clearer, but I'm hoping for another vision. So far, the best I can do right now is a variation of the Chinese Restaurant Process.
Imagine a very large Chinese restaurant, with (infinitely) many tables. Each table can seat (infinitely) many patrons but only serves a single set of dishes to all of them. The first patron to sit at an empty table orders exactly what she likes from the menu for that table. When a new patron enters the restaurant, he surveys all of the tables. Each will have some items he likes and some that he does not. By comparing his likes and dislikes with the menu on each table, we can calculate the probability that he will sit at each table as well as the probability that he will choose an empty table. If the tables represent the clusters and the patrons represent the data points then some clusters will be more likely than others to contain the point. Of course, the probabilities must all add up to 1. Maybe I can get Ted to comment on this posting and my dumbed-down version of DPC.