Friday, September 17, 2010

Mahout committers Ted Dunning, Grant Ingersoll and I met with some of our Mahout user friends over dinner at Panera's in Millbrae last night. The study of Machine Learning for me has always been a sequence of little mysteries to solve and this evening proved to be no exception. Ted kicked off the conversation with a provocative statement that ML is really about different ways to extract [meaningful] models from large volumes of data and that classification, clustering, SVD (singular value decomposition) and recommendation are all really just different ways to skin the same cat. It seemed preposterous at first. He drew a box with lots of arrows going in on the left and just a few arrows coming out on the right to illustrate how each of these processes consume volumes of data and produce much smaller and more concise models of it. He went on to say that each of these techniques is better than its brethren at extracting certain kinds of meaning and that real world data often will require more than one of these techniques to be chained together to gain accurate insight (more meaningful models).

We've been having some discussions on the dev@mahout.apache.org mailing list recently about how to unify our clustering and classification data structures in order to make them more "plug and play". I had done some refactoring of the clustering data structures in order to eliminate a lot of redundant code and unify their behaviors. Ted had introduced an AbstractVectorClassifier a couple of months ago as a way of unifying all the classification algorithms and was looking at one of its new subclasses, the VectorModelClassifier; in the clustering package. Where had it come from? After reviewing the code I recalled it as an experiment I'd done to see if I could integrate our new clustering models into the classification framework. I had not intended to commit it at the time and so I didn't recognize it at first but there it was: a classifier that could classify vectors based upon the model output of any of our clustering jobs. The beginnings of integration were at hand.

All of our clustering jobs can perform a final job step which assigns each input vector to one or more of the models which the clustering has produced. Said differently, they can all classify each input vector to one or more of the models. And when I think about the cluster-creation steps that our clustering algorithms all perform as training, the unification becomes even clearer. Of course, Ted pointed out, clustering is really just unsupervised classification and classification is really just supervised clustering. I think I'm starting to get it! Both consume large volumes of raw data and produce, either supervised or not, a smaller set of models that characterize the data: its meaning.

So what about SVD? Our SVD implementation uses Lanczos' algorithm to produce a set of eigenvectors and their associated eigenvalues from an input matrix. The eigenvectors and eigenvalues are typically much smaller than the original data and may be used in place of it for many computations. Hey, they're models too! The clustering of text documents; for example, typically involves a very high dimensionality, sparse, term vector for each document in a corpus. If one tries to cluster these raw vectors one often confronts "the curse of dimensionality" and the clustering does not produce useful results. If, instead, one uses SVD to first reduce the dimensionality of the term vectors and then clusters that data the results are often considerably improved. To summarize, SVD is a process which extracts a [meaningful] set of models (the eigenvectors and eigenvalues) from the data. Because it is unsupervised, might one think of it as a form of clustering? IDK. At least it is one of the Mahout services that can be chained together with clustering to produce more insightful results.

Matrices are also used a lot by our recommender services to recommend items to users based upon some metrics of user preference for each item. These co-occurrence matrices are generally large and unwieldy. In user based recommending, the goal is to recommend items to users based upon what items similar users found most interesting and the co-occurrence matrix has size equal to the number of users squared; often a huge matrix. In item-based recommending, the goal is to recommend based upon which items are similar to each other and the co-occurrence matrix has size equal to the number of items squared; usually smaller but still quite large. SVD can be used in both cases to reduce the dimensionality of the co-occurrence matrices. And so too can clustering services be used within a recommender engine to codify the similarity metrics used to make the recommendations. These services really do need to plug and play together.

Ok, I'm having a bit of an epiphany here and this may not all be spot on. But the proposition that the parts of Mahout which I've always viewed as being unrelated are actually interdependent is starting to grow on me. It's kind of a grand unification theory which may well lead to further integration and other improvements in the Mahout service portfolio as it plays out. A few mysteries got solved last night and a few more got added to the list. An evening well spent.

22 comments:

Federico Castanedo said...

Great Explanation!

peterson said...

This is very useful information. Thanks for sharing
cloud computing course in Chennai

Unknown said...

Is Aereo, a service that lets consumers stream over-the-air TV, akin to a cable company or a hardware provider like RadioShack? The justices of the Supreme Court appeared to struggle with that question on Tuesday morning as they listened to oral arguments in what many media watchers view as the most important TV-related court case in decades.


Find Cloud Updates Here : secure cloud

Unknown said...

This is definitely one of the best articles I have read in this website! Thanks Mate.

Salesforce Training Institutes in Chennai

Unknown said...


I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information about cloud computing that helps me a lot to improvise myself. I hope this will help many readers who are in need of this vital piece of information. Thanks for sharing & keep your blog updated.
Informatica training in chennai|Informatica institutes in Chennai|Fita Chennai Reviews

Unknown said...

Thanks for sharing this post to our vision...
Are you looking for Best Salesforce training in Chennai? Let us know we are ready to serve for your career. FITA is pleased to inform you that; we provides practical training on all the technologies with the MNC exports having more than 5 years of experience in your preferred domain. Get your career with our knowledge.
Salesforce training institute in Chennai|Salesforce training

Unknown said...

The information you have given here is truly helpful to me. CCNA- It’s a certification program based on routing & switching for starting level network engineers that helps improve your investment in knowledge of networking & increase the value of employer’s network, if you want to take ccna course in Chennai get into FITA, thanks for sharing…
ccna training in Chennai | ccna training institute in Chennai

Unknown said...

Hi, I wish to be a regular contributor of your blog. I have read your blog. Your information is really useful for us. I did Software Testing Training at Fita training and placement academy which offer best Software Testing Course in Chennai with years of experienced professionals. This is really useful for me to make a bright career.

Unknown said...

The information you posted here is useful to make my career better keep updates..Recently I did oracle certification course at a leading academy. Suppose if anyone want to become an oracle certified professional reach FITA Oracle Training Center in Chennai, which offers Best Oracle Course in Chennai with years of experienced professionals.

Unknown said...

Your blog is really nice and informative. Thanks for sharing this post. Keep posting..

Regards..
Salesforce Administrator Training in Chennai

Unknown said...

SAS is a comprehensive statistical software system which integrates utilities for storing, modifying, analyzing, and graphing data. SAS runs on both Windows and UNIX platforms. And now this is the most widely used statistical software. To know more about this please visit this site. SAS Training in Chennai

Unknown said...

This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.
Regards,

PHP Training in Chennai|Salesforce training in Chennai

Unknown said...

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing dude
Regards,
Web design institutes in Chennai|Salesforce training

Anonymous said...

Great, Thanks for sharing this article.Really looking forward to read more.Awesome.
E-commerce Training Courses In Coimbatore

gowthunan said...

You know what you’re talking about, why waste your intelligence on just posting videos to your blog when you could be giving us something enlightening to read?
industrial safety courses in chennai

Riyas Fathin said...

Hey, would you mind if I share your blog with my twitter group? There’s a lot of folks that I think would enjoy your content. Please let me know. Thank you.
Top Data Science Training in Chennai |Data Science Training in Chennai
Top R Training in Chennai |Advanced R Training in Chennai
Top AI Training in Chennai |Advanced AI Training in Chennai
Top Python Training in Chennai |Advanced Python Training in Chennai
Top Machine Learning Training in Chennai |Advanced Machine Learning Training in Chennai

sangeeth said...

Very nice post and thanks for it .I like this blog and really good content.
spanish classes in chennai
spanish language in chennai
German Courses in Chennai
French Language Classes in Chennai
Informatica MDM Training in Chennai
Hadoop Admin Training in Chennai
content writing course in chennai
Spoken English Classes in Anna Nagar
Spoken English Classes in Tnagar

kamini kapoor said...


Nice BLOG!!
Robotics training in chennai
Internship for cse students in chennai
iot internship in chennai
Kaashiv infotech in bangalore
Free internship in chennai for mechanical engineering students
Inplant training
ECE internship in chennai
Internship for cse students in bangalore
Free internship for cse students in chennai
Internship for eee students in chennai


technorok said...

This is a subject it show really is near my heart... Best wishes! Exactly in which are your contact details even though?

Anonymous said...

Keep up the great work, I read few blog posts on this site and I believe that your website is really interesting and has loads of good info.

Big Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery

ram said...

beat info provided thanks
oracle training in chennai

tech said...

Incredible article. I'm managing a portion of these issues too.. best interiors