Wednesday, March 26, 2008

What is Mahout?

Around the end of January I saw an interesting post on the Hadoop users list announcing the creation of a new sub-project called Mahout under the Apache Lucene project. I decided this would be a good place to continue my Hadoop odyssey.

Using cloud computing technologies such as EC2, Lucene, Nutch, Hadoop, Pig and Hbase it is now possible for even small companies to perform analytics over the entire Worldwide Web. The emerging challenge is now to develop improved analytics that can separate relevant information from spam, learn from previous experience and organize information in ever more meaningful ways.

In recent years a rather large community of researchers has addressed the problem of extracting useful intelligence from the Web. Whether is it classifying documents into categories, clustering them to form groups that make sense to users or ranking them by relevancy given some query, these methods fall under the broad category of machine learning algorithms. Unfortunately, most of the available algorithms are either proprietary, under restrictive licenses or do not scale to massive amounts of information.

The focus of the Mahout project is to develop commercially-friendly, scalable machine learning algorithms such as classification, clustering, regression and dimension reduction under the Apache brand and on top of Hadoop. Its initial areas of focus are to build out the ten machine learning libraries detailed in Map-Reduce for Machine Learning on Multicore, by Chu, Kim, Liu, Yu, Bradski, Ng & Olukotun of Stanford University. Though the project is only in its second month, we have an active and growing community with initial submissions in the areas of clustering, classification and matrix operations.

The Mahout team chose this name for the project out of admiration and respect for work of the Hadoop project, whose logo is that of an elephant. According to Wikipedia, “A mahout is a person who drives an elephant”. It goes on to say that the “Sanskrit language distinguishes three types [of mahouts]: Reghawan, who use love to control their elephants, Yukthiman, who use ingenuity to outsmart them and Balwan, who control elephants with cruelty”. We intend to practice only in the first two categories and welcome individuals with similar values who would like to contribute to the project.

1 comment:

pravesh said...

Nice article. Lucene along with all sub-projects like SOLR, Hadoop, Mahout gives a lot of oppurtunities to open source developers to explore in various areas like search engineering, Map-Reduce, and now machine learning too :)