Saturday, March 29, 2008

How Big is a Petabyte, Anyway?

I woke up a bit early this morning wondering how to describe a petabyte. I can easily count to a hundred and do math in that range. In the thousands and above, I resort to scientific notation. A petabyte is 1,000,000,000,000,000 bytes, or 10^15 bytes. How do I convey that incredible size in a comprehensible way?

Well, most of us nowdays have a gigabyte of memory in our laptops. A thousand laptops is a terabyte and a million laptops is a petabyte. What if you could make a million laptops all work together on the same problem? What could they do? What would you want them to do?

There are about two hundred billion stars in our Milky Way Galaxy, or 2x10^11 stars. Five thousand Milky Way galaxies would contain 10^15 stars, a petastar.

If you started typing and typed a petabyte of data it would show on the screen as a very long string. On the screen, how long would it be? Well, if you typed 5 characters per centimeter, then 10^15 characters would be 2x10^14 centimeters long, two billion kilometers.

Ok, thats still pretty incomprehensible. Let's see how long the beam of a flashlight takes to go from one end to the other. Light travels at 3x10^10 cm/sec or 6,666 seconds to traverse the petabyte string, almost two hours.

This is still hard to grasp. Consider instead getting on a jet and flying at 1000 km/hr over the string. The trip would take 2x10^6 hours, about 220 years. Better fly first class.

Of course, you could not type a petabyte yourself in your lifetime, nor could you and all of your friends. But the Web is perhaps a tenth of a petabyte or so right now and is still growing really fast. Lots of people are typing at the same time and computers are helping them. With Hadoop on a good sized cloud, you can run analytics over that dataset in reasonable time. What kind of questions do you want to ask?

Thursday, March 27, 2008

The First Hadoop Summit

On March 25th, I attended the first Hadoop Summit. When I got to the conference, I picked up my t-shirt and introduced myself to Ajay Anand, the Hadoop product manager and conference organizer. What had started out as a small, local workshop in the minds of the organizers had mushroomed into an overnight sensation. The original venue had space for perhaps a hundred participants and was booked full within a day of the registration. After finding a bigger room at Yahoo! which was also immediately filled, they partnered with Amazon Web Services to move the venue to the Network Meeting Center in Santa Clara, CA. By the time I arrived, that venue was filled to standing room only. I went into the auditorium and found a seat next to a gentleman who is head of Emerging Technology of a Korean company. He told me he has a 200 node cluster and is interested in new marketing applications that are now possible using this technology. There are lots of similar business opportunities awaiting leading edge adopters of Hadoop.

Ajay opened the conference and introduced Doug Cutting and Eric Baldeshweieler who gave a historical overview of the Hadoop evolution up to where it is today in production at Y! Hadoop began its life as a part of the Apache Lucene Nutch project, which needed a distributed file system to store the web pages returned by its crawlers. They were aware of the work being done at Google and wanted to exploit the Map/Reduce paradigm to run computations over these very large data sets. The project snowballed with the support of an active, worldwide, open source community abetted by Yahoo! investments and has recently become a top level Apache project of its own right.

Five different speakers followed this introduction that each described work being done on top of the Hadoop platform.

  • Chris Olston (Y!) gave a nice introduction to Pig, which I have explored a bit and have found to be quite powerful. "Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets."
  • Chris, Kevin Beyer (IBM) gave a talk on JAQL which is a new, more SQL-like, query language for processing JSON data on top of Hadoop.
  • Michael Isard (Microsoft Research) described DryadLINQ, a highly parallel environment for developing computations on a cloud computing infrastructure. He showed that map/reduce computations can be phrased quite simply using their language. The reaction of several people I spoke with was, unfortunately, "too bad it is buried inside of Microsoft's platform".
  • Andy Konsinski (UC Berkeley) talked about the X-trace monitoring framework they had embedded inside of Hadoop adding only about 500 lines of code. This seems to be potentially useful in understanding the actual behavior of M/R jobs and they promise to clean it up and submit it as a patch.
  • Ben Reed (Y!) discussed Zookeeper, a hierarchical namespace directory service that can be used for coordinating and communicating between multiple user jobs on Hadoop.
After lunch Michael Stack (Powerset) gave an introduction to HBase, a scalable, robust, column-oriented database that is build upon the Hadoop distributed file system. The project is in its second year and is based upon BigTable, another Google technology. It stores very large tables which can be accessed by row primary key, column name and a timestamp. I've not yet experimented with HBase, but will likely need to utilize it in my Mahout work for storing and manipulating very large vectors and matrices. Afterwards, Brian Duxbury (Rapleaf) described how HBase and Hadoop are used to search the Web for information about people's reputations that can be gleaned from various online sources. How can I influence that score?

There were several additional talks that addressed application level work being done on top of the Hadoop platform:

  • Jinesh Varia (Amazon) talked about how they deploy GrepTheWeb jobs on Hadoop clusters that are materialized on EC2 to run and then vanish when they are finished. This is an example of the kind of technology that is now available to anybody with a wallet and a good map/reduce algorithm that they want to use for generating business value.
  • Steve Schlosser (Intel) and David O’Hallaron (CMU) talked about building ground models of Southern California, using Hadoop in a novel processing application of seismic data.
  • Mike Haley (Autodesk) talked about how they are using classification algorithms and Hadoop to correlate the product catalogs of building parts suppliers into their graphical component library that is used for CAD.
  • Christian Kunz (Y!) described their recently-announced production use of Hadoop. He showed some very big numbers and impressive improvements over their previous technology in terms of scale, reliability, manageability and speed. To generate their web search index, they routinely run 100k map and 10k reduce jobs over hundreds of terabytes of web data using a cluster with 10k cores and 20 petabytes of disk space. This illustrates what is now possible to do in production settings with Hadoop.
  • Jimmy Lin (University of MD) and Christophe Bisciglia (Google) talked about natural language processing work going on at UMD and other universities. I got a chance to shake Christophe’s hand during the happy hour and to thank him for revolutionizing my life (see My Hadoop Odyssey, below).

I had a fantastic opportunity to sit on the futures panel with leaders of the Hadoop community (Sameer Paranjpye, Sanjay Radia, Owen O’Malley (all Y!) and Chad Walters (Powerset)) to introduce the new Mahout project while they presented the future directions of Hadoop and Hbase. The panel gave me an outstanding soapbox, generated a lot of interest in machine learning applications and several great opportunities for followup discussions with people from the greater Hadoop community.

Wednesday, March 26, 2008

What is Mahout?

Around the end of January I saw an interesting post on the Hadoop users list announcing the creation of a new sub-project called Mahout under the Apache Lucene project. I decided this would be a good place to continue my Hadoop odyssey.

Using cloud computing technologies such as EC2, Lucene, Nutch, Hadoop, Pig and Hbase it is now possible for even small companies to perform analytics over the entire Worldwide Web. The emerging challenge is now to develop improved analytics that can separate relevant information from spam, learn from previous experience and organize information in ever more meaningful ways.

In recent years a rather large community of researchers has addressed the problem of extracting useful intelligence from the Web. Whether is it classifying documents into categories, clustering them to form groups that make sense to users or ranking them by relevancy given some query, these methods fall under the broad category of machine learning algorithms. Unfortunately, most of the available algorithms are either proprietary, under restrictive licenses or do not scale to massive amounts of information.

The focus of the Mahout project is to develop commercially-friendly, scalable machine learning algorithms such as classification, clustering, regression and dimension reduction under the Apache brand and on top of Hadoop. Its initial areas of focus are to build out the ten machine learning libraries detailed in Map-Reduce for Machine Learning on Multicore, by Chu, Kim, Liu, Yu, Bradski, Ng & Olukotun of Stanford University. Though the project is only in its second month, we have an active and growing community with initial submissions in the areas of clustering, classification and matrix operations.

The Mahout team chose this name for the project out of admiration and respect for work of the Hadoop project, whose logo is that of an elephant. According to Wikipedia, “A mahout is a person who drives an elephant”. It goes on to say that the “Sanskrit language distinguishes three types [of mahouts]: Reghawan, who use love to control their elephants, Yukthiman, who use ingenuity to outsmart them and Balwan, who control elephants with cruelty”. We intend to practice only in the first two categories and welcome individuals with similar values who would like to contribute to the project.

My Hadoop Odyssey

It all started on a lazy Sunday afternoon back in December. The new issue of Business Week had just arrived and it had an interesting-looking cover about a new technology called "cloud computing". The long-haired guy on the cover looked somewhat out of place for the usual BW fair, but I had never heard about the technology and so it piqued my interest. Reading through the article, it turned out to be about massively parallel computing platforms and their emerging impact on the business world of web-scale information processing. The guy on the cover, Christophe Bisciglia, was a young engineer at Google who had become their cloud computing guru and was working with some universities to establish curricula for teaching students about this new technology.

As a necessary ingredient of their web search business, Google had developed a way to use massive arrays of general purpose computers as a single computational platform. They had racks and racks full of off-the-shelf PCs, each with its own memory and disk drive. Using proprietary technology based upon an old functional programming technique called Map/Reduce they are able to store massive amounts of web search data redundantly on these computer arrays and run jobs over a database consisting of the entire world wide web. I'd always wondered how they did it.

The article went on to mention Hadoop, an open source version of this technology that was being developed by Yahoo!, IBM, Amazon and other companies to make this technology available under the Apache Software Foundation's flexible licensing terms. Though this was a competitive effort to the Google work, it was behind them on the learning curve and it provided an open platform to train young engineers to think in terms of these massively parallel computations.

It also provided me with a window into this new technology. That evening, I downloaded a copy and started reading the documentation. By midnight, I had a 1-node Hadoop cloud running on my laptop and was running some of the example jobs. The next day I want into my office at CollabNet and commandeered a few of my colleague's Linux boxes to build a 12-node cloud that had over a terabyte of storage and a dozen CPUs. Then I went looking for some data to munch with it.

CollabNet's in the globally-distributed software development business, not the web search business and so about the only large sources of data we had were the logs from our Apache web servers. I got ops to give me about 6 gb of logs and started writing a program to extract some usage information from them. In short time, I had my first map/reduce application tested using the very fine Eclipse plugin provided by IBM. I ran it on a single-node cluster against 5 months of logs and the program took about 120 minutes to complete. Then I launched it on my 12-node cloud and it took only 12 minutes - almost linear scalability with cluster size. This really cemented my interest.

There was one aspect of CollabNet's business that, I felt, might benefit from this technology. CUBiT, a nascent product designed to manage pools of developer boxes, allows engineers to check a machine out from a pool, install a particular operating system profile on it, check out a particular version of our software, build it and use it for testing. Using the CUBiT user interface, I was able to see that we had literally hundreds of Linux boxes in use around the company. I was also able to see that most of them were only about 2-5% utilized, sitting there stirring and warming the air in their machine rooms most of the time just waiting for engineers to need them.

We were sitting on top of a massive supercomputer and did not even realize it! How many of our customers had similar environments? Our customers included major Fortune 1000 corporations. Probably ours was one of the smallest latent clouds around. What if we bundled Hadoop into CUBiT? It would be totally opportunistic as a product feature but it would enable our customers to develop and run jobs that they never even dreamed were possible, right in their own laboratories, for free. "Buy CUBiT, get a free supercomputer" became my mantra.