Thursday, March 27, 2008

The First Hadoop Summit

On March 25th, I attended the first Hadoop Summit. When I got to the conference, I picked up my t-shirt and introduced myself to Ajay Anand, the Hadoop product manager and conference organizer. What had started out as a small, local workshop in the minds of the organizers had mushroomed into an overnight sensation. The original venue had space for perhaps a hundred participants and was booked full within a day of the registration. After finding a bigger room at Yahoo! which was also immediately filled, they partnered with Amazon Web Services to move the venue to the Network Meeting Center in Santa Clara, CA. By the time I arrived, that venue was filled to standing room only. I went into the auditorium and found a seat next to a gentleman who is head of Emerging Technology of a Korean company. He told me he has a 200 node cluster and is interested in new marketing applications that are now possible using this technology. There are lots of similar business opportunities awaiting leading edge adopters of Hadoop.

Ajay opened the conference and introduced Doug Cutting and Eric Baldeshweieler who gave a historical overview of the Hadoop evolution up to where it is today in production at Y! Hadoop began its life as a part of the Apache Lucene Nutch project, which needed a distributed file system to store the web pages returned by its crawlers. They were aware of the work being done at Google and wanted to exploit the Map/Reduce paradigm to run computations over these very large data sets. The project snowballed with the support of an active, worldwide, open source community abetted by Yahoo! investments and has recently become a top level Apache project of its own right.

Five different speakers followed this introduction that each described work being done on top of the Hadoop platform.

  • Chris Olston (Y!) gave a nice introduction to Pig, which I have explored a bit and have found to be quite powerful. "Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets."
  • Chris, Kevin Beyer (IBM) gave a talk on JAQL which is a new, more SQL-like, query language for processing JSON data on top of Hadoop.
  • Michael Isard (Microsoft Research) described DryadLINQ, a highly parallel environment for developing computations on a cloud computing infrastructure. He showed that map/reduce computations can be phrased quite simply using their language. The reaction of several people I spoke with was, unfortunately, "too bad it is buried inside of Microsoft's platform".
  • Andy Konsinski (UC Berkeley) talked about the X-trace monitoring framework they had embedded inside of Hadoop adding only about 500 lines of code. This seems to be potentially useful in understanding the actual behavior of M/R jobs and they promise to clean it up and submit it as a patch.
  • Ben Reed (Y!) discussed Zookeeper, a hierarchical namespace directory service that can be used for coordinating and communicating between multiple user jobs on Hadoop.
After lunch Michael Stack (Powerset) gave an introduction to HBase, a scalable, robust, column-oriented database that is build upon the Hadoop distributed file system. The project is in its second year and is based upon BigTable, another Google technology. It stores very large tables which can be accessed by row primary key, column name and a timestamp. I've not yet experimented with HBase, but will likely need to utilize it in my Mahout work for storing and manipulating very large vectors and matrices. Afterwards, Brian Duxbury (Rapleaf) described how HBase and Hadoop are used to search the Web for information about people's reputations that can be gleaned from various online sources. How can I influence that score?

There were several additional talks that addressed application level work being done on top of the Hadoop platform:

  • Jinesh Varia (Amazon) talked about how they deploy GrepTheWeb jobs on Hadoop clusters that are materialized on EC2 to run and then vanish when they are finished. This is an example of the kind of technology that is now available to anybody with a wallet and a good map/reduce algorithm that they want to use for generating business value.
  • Steve Schlosser (Intel) and David O’Hallaron (CMU) talked about building ground models of Southern California, using Hadoop in a novel processing application of seismic data.
  • Mike Haley (Autodesk) talked about how they are using classification algorithms and Hadoop to correlate the product catalogs of building parts suppliers into their graphical component library that is used for CAD.
  • Christian Kunz (Y!) described their recently-announced production use of Hadoop. He showed some very big numbers and impressive improvements over their previous technology in terms of scale, reliability, manageability and speed. To generate their web search index, they routinely run 100k map and 10k reduce jobs over hundreds of terabytes of web data using a cluster with 10k cores and 20 petabytes of disk space. This illustrates what is now possible to do in production settings with Hadoop.
  • Jimmy Lin (University of MD) and Christophe Bisciglia (Google) talked about natural language processing work going on at UMD and other universities. I got a chance to shake Christophe’s hand during the happy hour and to thank him for revolutionizing my life (see My Hadoop Odyssey, below).

I had a fantastic opportunity to sit on the futures panel with leaders of the Hadoop community (Sameer Paranjpye, Sanjay Radia, Owen O’Malley (all Y!) and Chad Walters (Powerset)) to introduce the new Mahout project while they presented the future directions of Hadoop and Hbase. The panel gave me an outstanding soapbox, generated a lot of interest in machine learning applications and several great opportunities for followup discussions with people from the greater Hadoop community.


Anonymous said...

MapReduce is the wave of the future...not all analytic database vendors think it's a "major step backward".

Today, Aster Data announced the world's first In-Database MapReduce solution, combining the power of MapReduce with the rich functionality of a SQL MPP database.

Sai Santosh said...

There will be a lot of difference in attending hadoop online training center compared to attending a live classroom training. However, websites like this with rich in information will be very useful for gaining additional knowledge.