Wednesday, March 26, 2008

My Hadoop Odyssey

It all started on a lazy Sunday afternoon back in December. The new issue of Business Week had just arrived and it had an interesting-looking cover about a new technology called "cloud computing". The long-haired guy on the cover looked somewhat out of place for the usual BW fair, but I had never heard about the technology and so it piqued my interest. Reading through the article, it turned out to be about massively parallel computing platforms and their emerging impact on the business world of web-scale information processing. The guy on the cover, Christophe Bisciglia, was a young engineer at Google who had become their cloud computing guru and was working with some universities to establish curricula for teaching students about this new technology.

As a necessary ingredient of their web search business, Google had developed a way to use massive arrays of general purpose computers as a single computational platform. They had racks and racks full of off-the-shelf PCs, each with its own memory and disk drive. Using proprietary technology based upon an old functional programming technique called Map/Reduce they are able to store massive amounts of web search data redundantly on these computer arrays and run jobs over a database consisting of the entire world wide web. I'd always wondered how they did it.

The article went on to mention Hadoop, an open source version of this technology that was being developed by Yahoo!, IBM, Amazon and other companies to make this technology available under the Apache Software Foundation's flexible licensing terms. Though this was a competitive effort to the Google work, it was behind them on the learning curve and it provided an open platform to train young engineers to think in terms of these massively parallel computations.

It also provided me with a window into this new technology. That evening, I downloaded a copy and started reading the documentation. By midnight, I had a 1-node Hadoop cloud running on my laptop and was running some of the example jobs. The next day I want into my office at CollabNet and commandeered a few of my colleague's Linux boxes to build a 12-node cloud that had over a terabyte of storage and a dozen CPUs. Then I went looking for some data to munch with it.

CollabNet's in the globally-distributed software development business, not the web search business and so about the only large sources of data we had were the logs from our Apache web servers. I got ops to give me about 6 gb of logs and started writing a program to extract some usage information from them. In short time, I had my first map/reduce application tested using the very fine Eclipse plugin provided by IBM. I ran it on a single-node cluster against 5 months of logs and the program took about 120 minutes to complete. Then I launched it on my 12-node cloud and it took only 12 minutes - almost linear scalability with cluster size. This really cemented my interest.

There was one aspect of CollabNet's business that, I felt, might benefit from this technology. CUBiT, a nascent product designed to manage pools of developer boxes, allows engineers to check a machine out from a pool, install a particular operating system profile on it, check out a particular version of our software, build it and use it for testing. Using the CUBiT user interface, I was able to see that we had literally hundreds of Linux boxes in use around the company. I was also able to see that most of them were only about 2-5% utilized, sitting there stirring and warming the air in their machine rooms most of the time just waiting for engineers to need them.

We were sitting on top of a massive supercomputer and did not even realize it! How many of our customers had similar environments? Our customers included major Fortune 1000 corporations. Probably ours was one of the smallest latent clouds around. What if we bundled Hadoop into CUBiT? It would be totally opportunistic as a product feature but it would enable our customers to develop and run jobs that they never even dreamed were possible, right in their own laboratories, for free. "Buy CUBiT, get a free supercomputer" became my mantra.

No comments: