Friday, September 17, 2010

Mahout committers Ted Dunning, Grant Ingersoll and I met with some of our Mahout user friends over dinner at Panera's in Millbrae last night. The study of Machine Learning for me has always been a sequence of little mysteries to solve and this evening proved to be no exception. Ted kicked off the conversation with a provocative statement that ML is really about different ways to extract [meaningful] models from large volumes of data and that classification, clustering, SVD (singular value decomposition) and recommendation are all really just different ways to skin the same cat. It seemed preposterous at first. He drew a box with lots of arrows going in on the left and just a few arrows coming out on the right to illustrate how each of these processes consume volumes of data and produce much smaller and more concise models of it. He went on to say that each of these techniques is better than its brethren at extracting certain kinds of meaning and that real world data often will require more than one of these techniques to be chained together to gain accurate insight (more meaningful models).

We've been having some discussions on the dev@mahout.apache.org mailing list recently about how to unify our clustering and classification data structures in order to make them more "plug and play". I had done some refactoring of the clustering data structures in order to eliminate a lot of redundant code and unify their behaviors. Ted had introduced an AbstractVectorClassifier a couple of months ago as a way of unifying all the classification algorithms and was looking at one of its new subclasses, the VectorModelClassifier; in the clustering package. Where had it come from? After reviewing the code I recalled it as an experiment I'd done to see if I could integrate our new clustering models into the classification framework. I had not intended to commit it at the time and so I didn't recognize it at first but there it was: a classifier that could classify vectors based upon the model output of any of our clustering jobs. The beginnings of integration were at hand.

All of our clustering jobs can perform a final job step which assigns each input vector to one or more of the models which the clustering has produced. Said differently, they can all classify each input vector to one or more of the models. And when I think about the cluster-creation steps that our clustering algorithms all perform as training, the unification becomes even clearer. Of course, Ted pointed out, clustering is really just unsupervised classification and classification is really just supervised clustering. I think I'm starting to get it! Both consume large volumes of raw data and produce, either supervised or not, a smaller set of models that characterize the data: its meaning.

So what about SVD? Our SVD implementation uses Lanczos' algorithm to produce a set of eigenvectors and their associated eigenvalues from an input matrix. The eigenvectors and eigenvalues are typically much smaller than the original data and may be used in place of it for many computations. Hey, they're models too! The clustering of text documents; for example, typically involves a very high dimensionality, sparse, term vector for each document in a corpus. If one tries to cluster these raw vectors one often confronts "the curse of dimensionality" and the clustering does not produce useful results. If, instead, one uses SVD to first reduce the dimensionality of the term vectors and then clusters that data the results are often considerably improved. To summarize, SVD is a process which extracts a [meaningful] set of models (the eigenvectors and eigenvalues) from the data. Because it is unsupervised, might one think of it as a form of clustering? IDK. At least it is one of the Mahout services that can be chained together with clustering to produce more insightful results.

Matrices are also used a lot by our recommender services to recommend items to users based upon some metrics of user preference for each item. These co-occurrence matrices are generally large and unwieldy. In user based recommending, the goal is to recommend items to users based upon what items similar users found most interesting and the co-occurrence matrix has size equal to the number of users squared; often a huge matrix. In item-based recommending, the goal is to recommend based upon which items are similar to each other and the co-occurrence matrix has size equal to the number of items squared; usually smaller but still quite large. SVD can be used in both cases to reduce the dimensionality of the co-occurrence matrices. And so too can clustering services be used within a recommender engine to codify the similarity metrics used to make the recommendations. These services really do need to plug and play together.

Ok, I'm having a bit of an epiphany here and this may not all be spot on. But the proposition that the parts of Mahout which I've always viewed as being unrelated are actually interdependent is starting to grow on me. It's kind of a grand unification theory which may well lead to further integration and other improvements in the Mahout service portfolio as it plays out. A few mysteries got solved last night and a few more got added to the list. An evening well spent.