Ajay opened the conference and introduced Doug Cutting and Eric Baldeshweieler who gave a historical overview of the Hadoop evolution up to where it is today in production at Y! Hadoop began its life as a part of the Apache Lucene Nutch project, which needed a distributed file system to store the web pages returned by its crawlers. They were aware of the work being done at Google and wanted to exploit the Map/Reduce paradigm to run computations over these very large data sets. The project snowballed with the support of an active, worldwide, open source community abetted by Yahoo! investments and has recently become a top level Apache project of its own right.
Five different speakers followed this introduction that each described work being done on top of the Hadoop platform.
- Chris Olston (Y!) gave a nice introduction to Pig, which I have explored a bit and have found to be quite powerful. "Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets."
- Chris, Kevin Beyer (IBM) gave a talk on JAQL which is a new, more SQL-like, query language for processing JSON data on top of Hadoop.
- Michael Isard (Microsoft Research) described DryadLINQ, a highly parallel environment for developing computations on a cloud computing infrastructure. He showed that map/reduce computations can be phrased quite simply using their language. The reaction of several people I spoke with was, unfortunately, "too bad it is buried inside of Microsoft's platform".
- Andy Konsinski (UC Berkeley) talked about the X-trace monitoring framework they had embedded inside of Hadoop adding only about 500 lines of code. This seems to be potentially useful in understanding the actual behavior of M/R jobs and they promise to clean it up and submit it as a patch.
- Ben Reed (Y!) discussed Zookeeper, a hierarchical namespace directory service that can be used for coordinating and communicating between multiple user jobs on Hadoop.
There were several additional talks that addressed application level work being done on top of the Hadoop platform:
- Jinesh Varia (Amazon) talked about how they deploy GrepTheWeb jobs on Hadoop clusters that are materialized on EC2 to run and then vanish when they are finished. This is an example of the kind of technology that is now available to anybody with a wallet and a good map/reduce algorithm that they want to use for generating business value.
- Steve Schlosser (Intel) and David O’Hallaron (CMU) talked about building ground models of Southern California, using Hadoop in a novel processing application of seismic data.
- Mike Haley (Autodesk) talked about how they are using classification algorithms and Hadoop to correlate the product catalogs of building parts suppliers into their graphical component library that is used for CAD.
- Christian Kunz (Y!) described their recently-announced production use of Hadoop. He showed some very big numbers and impressive improvements over their previous technology in terms of scale, reliability, manageability and speed. To generate their web search index, they routinely run 100k map and 10k reduce jobs over hundreds of terabytes of web data using a cluster with 10k cores and 20 petabytes of disk space. This illustrates what is now possible to do in production settings with Hadoop.
- Jimmy Lin (University of MD) and Christophe Bisciglia (Google) talked about natural language processing work going on at UMD and other universities. I got a chance to shake Christophe’s hand during the happy hour and to thank him for revolutionizing my life (see My Hadoop Odyssey, below).
I had a fantastic opportunity to sit on the futures panel with leaders of the Hadoop community (Sameer Paranjpye, Sanjay Radia, Owen O’Malley (all Y!) and Chad Walters (Powerset)) to introduce the new Mahout project while they presented the future directions of Hadoop and Hbase. The panel gave me an outstanding soapbox, generated a lot of interest in machine learning applications and several great opportunities for followup discussions with people from the greater Hadoop community.