Friday, September 17, 2010

Mahout committers Ted Dunning, Grant Ingersoll and I met with some of our Mahout user friends over dinner at Panera's in Millbrae last night. The study of Machine Learning for me has always been a sequence of little mysteries to solve and this evening proved to be no exception. Ted kicked off the conversation with a provocative statement that ML is really about different ways to extract [meaningful] models from large volumes of data and that classification, clustering, SVD (singular value decomposition) and recommendation are all really just different ways to skin the same cat. It seemed preposterous at first. He drew a box with lots of arrows going in on the left and just a few arrows coming out on the right to illustrate how each of these processes consume volumes of data and produce much smaller and more concise models of it. He went on to say that each of these techniques is better than its brethren at extracting certain kinds of meaning and that real world data often will require more than one of these techniques to be chained together to gain accurate insight (more meaningful models).

We've been having some discussions on the dev@mahout.apache.org mailing list recently about how to unify our clustering and classification data structures in order to make them more "plug and play". I had done some refactoring of the clustering data structures in order to eliminate a lot of redundant code and unify their behaviors. Ted had introduced an AbstractVectorClassifier a couple of months ago as a way of unifying all the classification algorithms and was looking at one of its new subclasses, the VectorModelClassifier; in the clustering package. Where had it come from? After reviewing the code I recalled it as an experiment I'd done to see if I could integrate our new clustering models into the classification framework. I had not intended to commit it at the time and so I didn't recognize it at first but there it was: a classifier that could classify vectors based upon the model output of any of our clustering jobs. The beginnings of integration were at hand.

All of our clustering jobs can perform a final job step which assigns each input vector to one or more of the models which the clustering has produced. Said differently, they can all classify each input vector to one or more of the models. And when I think about the cluster-creation steps that our clustering algorithms all perform as training, the unification becomes even clearer. Of course, Ted pointed out, clustering is really just unsupervised classification and classification is really just supervised clustering. I think I'm starting to get it! Both consume large volumes of raw data and produce, either supervised or not, a smaller set of models that characterize the data: its meaning.

So what about SVD? Our SVD implementation uses Lanczos' algorithm to produce a set of eigenvectors and their associated eigenvalues from an input matrix. The eigenvectors and eigenvalues are typically much smaller than the original data and may be used in place of it for many computations. Hey, they're models too! The clustering of text documents; for example, typically involves a very high dimensionality, sparse, term vector for each document in a corpus. If one tries to cluster these raw vectors one often confronts "the curse of dimensionality" and the clustering does not produce useful results. If, instead, one uses SVD to first reduce the dimensionality of the term vectors and then clusters that data the results are often considerably improved. To summarize, SVD is a process which extracts a [meaningful] set of models (the eigenvectors and eigenvalues) from the data. Because it is unsupervised, might one think of it as a form of clustering? IDK. At least it is one of the Mahout services that can be chained together with clustering to produce more insightful results.

Matrices are also used a lot by our recommender services to recommend items to users based upon some metrics of user preference for each item. These co-occurrence matrices are generally large and unwieldy. In user based recommending, the goal is to recommend items to users based upon what items similar users found most interesting and the co-occurrence matrix has size equal to the number of users squared; often a huge matrix. In item-based recommending, the goal is to recommend based upon which items are similar to each other and the co-occurrence matrix has size equal to the number of items squared; usually smaller but still quite large. SVD can be used in both cases to reduce the dimensionality of the co-occurrence matrices. And so too can clustering services be used within a recommender engine to codify the similarity metrics used to make the recommendations. These services really do need to plug and play together.

Ok, I'm having a bit of an epiphany here and this may not all be spot on. But the proposition that the parts of Mahout which I've always viewed as being unrelated are actually interdependent is starting to grow on me. It's kind of a grand unification theory which may well lead to further integration and other improvements in the Mahout service portfolio as it plays out. A few mysteries got solved last night and a few more got added to the list. An evening well spent.

23 comments:

Federico Castanedo said...

Great Explanation!

RIAZ UDDIN said...

Awesome Blogging! Thanks to share like this valuable Information with us. Do you need Cloud Services? We are a Virtual IT MSP Company Specializing in Technology Services that allow Individuals, Entrepreneurs and Small Businesses to simplify IT. We can help you more. You can contact with us.

peterson said...

This is very useful information. Thanks for sharing
cloud computing course in Chennai

Neha Kapoor said...

Is Aereo, a service that lets consumers stream over-the-air TV, akin to a cable company or a hardware provider like RadioShack? The justices of the Supreme Court appeared to struggle with that question on Tuesday morning as they listened to oral arguments in what many media watchers view as the most important TV-related court case in decades.


Find Cloud Updates Here : secure cloud

jack wilson said...

This is definitely one of the best articles I have read in this website! Thanks Mate.

Salesforce Training Institutes in Chennai

Jesica Paul said...

Thanks for sharing informative article on cloud computing technology. Your article helped me a lot in understand the future of cloud technology. Having strong expertise in leading cloud based CRM like Salesforce will ensure better career prospects for aspiring professionals. Salesforce Training in Chennai

jhansi joe said...


I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information about cloud computing that helps me a lot to improvise myself. I hope this will help many readers who are in need of this vital piece of information. Thanks for sharing & keep your blog updated.
Informatica training in chennai|Informatica institutes in Chennai|Fita Chennai Reviews

Andria BZ said...

Thanks for sharing this post to our vision...
Are you looking for Best Salesforce training in Chennai? Let us know we are ready to serve for your career. FITA is pleased to inform you that; we provides practical training on all the technologies with the MNC exports having more than 5 years of experience in your preferred domain. Get your career with our knowledge.
Salesforce training institute in Chennai|Salesforce training

jhansi joe said...

The information you have given here is truly helpful to me. CCNA- It’s a certification program based on routing & switching for starting level network engineers that helps improve your investment in knowledge of networking & increase the value of employer’s network, if you want to take ccna course in Chennai get into FITA, thanks for sharing…
ccna training in Chennai | ccna training institute in Chennai

Emi Jackson said...

Hi, I wish to be a regular contributor of your blog. I have read your blog. Your information is really useful for us. I did Software Testing Training at Fita training and placement academy which offer best Software Testing Course in Chennai with years of experienced professionals. This is really useful for me to make a bright career.

rebeka christy said...

The information you posted here is useful to make my career better keep updates..Recently I did oracle certification course at a leading academy. Suppose if anyone want to become an oracle certified professional reach FITA Oracle Training Center in Chennai, which offers Best Oracle Course in Chennai with years of experienced professionals.

jack wilson said...

Your blog is really nice and informative. Thanks for sharing this post. Keep posting..

Regards..
Salesforce Administrator Training in Chennai

jack wilson said...

SAS is a comprehensive statistical software system which integrates utilities for storing, modifying, analyzing, and graphing data. SAS runs on both Windows and UNIX platforms. And now this is the most widely used statistical software. To know more about this please visit this site. SAS Training in Chennai

Manish Pandey said...


Thanks for sharing this valuable information.and I gathered some information from this blog. I did SAP Training Chennai, at FITA Academy which offer best SAP Course in Chennai with years of experienced professionals.

Chandrika Madapudi said...

Hi, Your blog is really very informative and useful for me. Thanks for sharing this valuable blog.
Regards..
Unix Training Institutes in Chennai

caroline jesi said...

This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.
Regards,

PHP Training in Chennai|Salesforce training in Chennai

varshini devi said...

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing dude
Regards,
Web design institutes in Chennai|Salesforce training

Stephen said...

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this. FITA chennai reviews

Rowan Seo said...

Great, Thanks for sharing this article.Really looking forward to read more.Awesome.
E-commerce Training Courses In Coimbatore

Anexas Europe said...

Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us. Do check Six Sigma Training in Bangalore | Six Sigma Training in Dubai & Get trained by an expert who will enrich you with the latest trends.

gowthunan said...

You know what you’re talking about, why waste your intelligence on just posting videos to your blog when you could be giving us something enlightening to read?
industrial safety courses in chennai

Bee Yes said...

Amazing Article ! I have bookmarked this article page as i received good information from this. All the best for the upcoming articles. I will be waiting for your new articles. Thank You ! Kindly Visit Us @ Coimbatore Travels | Ooty Travels | Coimbatore Airport Taxi | Coimbatore taxi

Deepika Analy said...

The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.

Software Testing Training
Software Testing Training in Chennai
Software Testing Training in OMR
Software Testing Training in Velachery
Software Testing Training in Thiruvanmiyur