Jeff on Cloud Computing: 2008

Thursday, August 21, 2008

Why I am a Northwest Airlines Promoter

Hi Honey,

I thought you might be interested in this email chain from NWA. I think I will also post it to my blog as it is an example of the excellent customer experience they have delivered. I do not know if this is related, or merely policy due to my new status in their elite flier program, but when I went to check in for my flight home they had upgraded me to first class without my taking any action whatsoever. Now, that made my day.

I'm a solid NWA promoter (netpromoter.com), of course. See you tomorrow,
Jeff

---------- Forwarded message ----------
From: Northwest Airlines
Date: Wed, Aug 20, 2008 at 5:41 AM
Subject: Re: Apology (KMM17979656V36086L0KM)
To: Jeff Eastman

Dear Mr. Eastman,

RE: Case Number 6186336

You are welcome.

Again, we apologize for the flight irregularity and look forward to
serving you on a future Northwest flight.

Sincerely,

Sarah Sanders
Customer Care
Northwest/KLM Airlines

Original Message Follows:
-------------------------

Hi Cassie,

Thank-you for your cordial letter acknowledging the technical problem you had on August 6th. As I understood from your ground staff, the delay was caused by the boarding ramp damaging the door of the aircraft due to an operator error. Your airport staff was able to book me on an alternative flight (in 1st class) and I got home only two hours late. I do not imagine you have this sort of problem very often and your proactive handling of the incident leaves me a strong NorthWest promoter in spite of these incidents.Your overall record is still excellent and I appreciate the extra points you have offered.

Thanks,
Jeff Eastman

On Tue, Aug 19, 2008 at 9:28 AM, Northwest Airlines < Northwest.Airlines@nwa.com> wrote:

> Dear Mr. Eastman,
>
> RE: Case Number 7143682
>
> On behalf of all the employees at Northwest Airlines, I would like to
> extend a sincere apology for the flight irregularity you experienced on
> Flight 5858 on August 6. Travelers expect us to provide dependable and
> reliable service and we failed on this occasion.
>
> Furthermore, we regret to learn you experienced a previous interruption
> on Flight 235 on April 25.
>
> As a gesture of apology and in recognition of your Silver Elite status,
> I have added 4000 WorldPerks bonus miles. Please allow 3-5 business
> days for the miles to appear in your account ****3265.
>
> My colleagues and I pledge to you that we are dedicated to providing
> good service. Unfortunately, a reality in this industry is that there
> will be times when we are forced to delay, cancel, or divert flights.
> Thank you for your support and for flying Northwest.
>
> Sincerely,
>
> Cassie Steidler
> Manager, Customer Care
> Northwest/KLM Airlines
>
>

Tuesday, June 10, 2008

WOM: Road Warrior War Stories

I had the occasion to fly via Continental Airlines to Erie, PA yesterday and have three war stories to recount. The first is a huge missed opportunity that leaves me a Continental detractor. The other two are positive experiences.

My flight out of SFO was delayed for mysterious "air traffic control delays", but only for 15 minutes and we departed under clear skies. When we arrived over Cleveland it was evident from the turbulence that there was "weather" in the area and, sure enough, we landed almost 45 minutes late. I had a close connection on another Continental jet to Erie and - when I checked with the gate attendant upon my arrival - was told they had not yet closed the door and if I hurried I could make my connection. Needless to say, I hurried, but when I arrived the door had been sealed anyway. No amount of cajoling the gate attendants could open it up, notwithstanding the fact that the airport itself was locked down because the local thunderstorm was now dumping buckets of rain on everything. I went to the customer service desk and was told - basically - "tough shit, you missed your flight and it was not our problem so we don't have to help you". I got a coupon with an 800 number to call if I actually wanted to overnight in some motel nearby on my nickel at a "discount". Flying just ain't what it used to be. You already knew that. Other airlines have actually held the doors for a minute or two so late connection passengers could make it. Not Continental.

Needless to say, I did not wish to take that option, so I took the tedious bus trip to the rental car terminal about five miles away to get a car to drive the last 100 miles. Unfortunately, all of the rental cars at all of the agencies had been booked already and there were none to be had. The guy at the Enterprise desk made a helpful phone call and determined that I could take a Greyhound bus to Erie if only I could get way downtown in Cleveland to the bus terminal and wait for up to an hour. At 10pm that was not a really attractive option but there were not any good alternatives at this point so finally I went back to the Hertz desk where I am a gold member. There, a wonderful woman who empathized with my plight said her supervisor was looking for more cars and was soon able to rent one to me. Two hours drive later I was checking into the Sheraton Hotel on the waterfront in Erie.

The Sheraton Bayfront Hotel in Erie is a new hotel and is not even in 411 directory service yet. It has lovely rooms that are well-appointed and comfortable. It is a most agreeable destination at 12pm on a stormy northeast night on Lake Erie and it has a nice little restaurant. This evening I had dinner there and had one the nicer presentations of a salmon Caesar salad that I have ever experienced. In Erie, Pennsylvania, go figure. I will be back to the hotel and to the restaurant for another meal.

I guess the epiphany for me in all of this is that the whole Net Promoter question is most relevant when we relate our personal war stories to our friends via word of mouth. If we have good experiences with a brand we tend to say nice things to our friends about it. When we have some bad experiences, we tend to say bad things. When we have neutral experiences we say nothing at all. And blogging allows everybody who can use Google or Yahoo! to become my virtual friends. Maybe I'm just venting but it feels good.

Thursday, June 5, 2008

Windward Portlets in the Clouds

I just finished uploading the latest image of Liferay 5.0.2 to our new prototype website that contains two new portlets that I just completed in addition to the hundred-plus already in Liferay. Touchpoint portlets allow me to build and host community web sites with embedded Net Promoter data capture at various touch points with its users. When community members answer one of the Touchpoint questions (on a scale of 0-10 of course) their score and comments are saved in the database. The comments are also posted automatically to one of three message boards, depending upon their classification as a promoter, detractor or neutral respondent.

The other new portlet is a Touchpoint Admin portlet that lets community administrators create new questions and monitor their results using tabular and charting formats. Of course, this is a baby step into the Net Promoter world. Any reasonably large community would generate a huge volume of messages and so my next project will be to work on a text analytics portlet to allow these comments to be filtered and organized in meaningful ways. This will draw me back into Mahout, where I'm exposed to some pretty heavy hitters in this field.

Of course, my site is also Hadoop-enabled. I have not yet figured out how to utilize cloud computing clusters for this task but I'm working on it. Maybe I'll build a portlet to administer my cloud first.

Oh, I'm not quite ready to go live with the new site, so stay tuned to Windward's current site for news.

Tuesday, May 20, 2008

Windward in the Clouds

Amazon is on the vanguard of the new cloud computing marketplace and, while I've been EC2-aware for months, until recently I've not actually gotten my hands on it. That changed about a week ago when I decided to see if I could bring up a Hadoop cluster. My longer term goal is to run some scalability tests of the Mahout clustering code and since I lost my little cluster when I left CollabNet I need a replacement.

The economics are really super: ten cents an hour for a box in one of their datacenters and fifteen cents per gigabyte per month for storage. A rather large run of an hour on twenty boxes costs $2, plus the storage costs. I figured I can afford that, so why not see what it takes?

The process was pretty simple. After signing up with Amazon Web Services and downloading their toolkit, I followed their excellent getting started tutorial and pretty soon had a Fedora 8 box running under my control. Getting Hadoop installed required a bit more work as the box comes with nothing but Linux on it. A couple of 'yum' installs later the Java environment was running and Hadoop was installed. I brought up a single node Hadoop cluster and then decided to wait for Hadoop 0.17 to release as it has some DNS optimizations that make running on EC2 simpler.

Since I had a little experience, and since Windward is in dire need of website rebranding improvements I decided to bring up a copy of the Liferay Enterprise Portal to see what that would be like. That required running their installation as well as installing MySQL. There were some script problems with the Liferay 5.0.1 RC script, but in all the process has been moderately easy.

Today I spent most of the day customizing the site with some graphics and an initial page layout. I still need to finish the rebranding and finally point my existing website URL to it, but Windward is now living in the clouds. Woo hoo.

Thursday, April 17, 2008

I Encounter Scary Math

My first programming experiences on the Mahout project were pretty straightforward. I was working on a problem that seemed to need some clustering - grouping data that is "similar". I listened to a couple of excellent Google Tutorials that explained how to use Map/Reduce to implement Canopy Clustering. This was a small leap from my previous experiences and I was able to implement it in a few days. In the process, I learned about k-Means Clustering since canopy clusters are often used as input to k-means. Now, the math was starting to slow me down but the algorithm was still pretty simple and so I was able to make good progress.

Then came mean shift clustering. A student had posted an email expressing interest in the algorithm and in a later correspondence included a reference to a comprehensive paper on it. The math was scary, completely opaque to me, and filled with new statistical terms. I must have read that paper a dozen times before it dawned upon me what it was doing. It's not so much that I could not interpret the notation, I just have a hard time mapping it into the real world so that I can visualize its intent. One night I had a vision of hydrogen atoms floating in interstellar space, weakly attracted to each other and slowly forming into clumps - gas clusters - that ultimately became stars. Somehow, that vision morphed into vast clouds of canopy clusters moving and merging together and an implementation was born. I still cannot prove it is correct, but it worked on a test dataset and it felt right so I committed it.

Then, Ted Dunning, an active contributor to the Hadoop and Mahout mailing lists, introduced me to Dirichlet Process Clustering. Unlike the other clustering algorithms which assign each data point to a single, "best" cluster, Dirichlet allows each point to be assigned to multiple clusters - each with an associated probability. This is much more realistic, but it makes the math really, really complicated and I'm still struggling to map the notation onto reality. Each time I read all the papers it gets a little bit clearer, but I'm hoping for another vision. So far, the best I can do right now is a variation of the Chinese Restaurant Process.

Imagine a very large Chinese restaurant, with (infinitely) many tables. Each table can seat (infinitely) many patrons but only serves a single set of dishes to all of them. The first patron to sit at an empty table orders exactly what she likes from the menu for that table. When a new patron enters the restaurant, he surveys all of the tables. Each will have some items he likes and some that he does not. By comparing his likes and dislikes with the menu on each table, we can calculate the probability that he will sit at each table as well as the probability that he will choose an empty table. If the tables represent the clusters and the patrons represent the data points then some clusters will be more likely than others to contain the point. Of course, the probabilities must all add up to 1. Maybe I can get Ted to comment on this posting and my dumbed-down version of DPC.

Friday, April 11, 2008

Word of Mouth Marketing: An Expedia Experience

My wife Deborah is a marketing professional who has taught me a lot about the emerging power of word-of-mouth marketing. We have taken turns running our consulting and technology company Windward Solutions while the other has a day job. Since April it has been my turn to be 'Jeff on a jet'.

I booked my first trip through Expedia.com. It was a package deal that included airfare and a hotel booking with Embassy Suites. I arrived at my destination and found that the hotel had no record of my reservation. Fortunately, they had a room and so I checked-in. When it came time to check out; however, I was faced with a second bill to pay for the same reservation. This was an order fulfillment problem between Expedia and Embassy Suites, not my problem.

I tried to get to speak with a person at Expedia about this double payment problem, and unfortunately American Airlines had canceled thousands of flights and the wait was interminable. I sent an email to the Expedia support desk explaining the situation and providing the details. I was told that they could not handle my complaint and that I needed to wait forever in their phone queue to be helped. I'm not inclined to do that, since this is not my problem, so I called my credit card company and contested the charge.

By the way, they had charged two transactions: one for the airfare and another for the hotel. The hotel charge was $133 more than the actual charge on my bill - some package deal.

What does this have to do with word-of-mouth marketing? Well, I know that there are many people who routinely search the blogosphere for what customers are saying about their experiences with corporations. There are also companies, such as Satmetrix, Biz360 and others (help me out here Deb), that do this for a living. So, I'm sending this little anecdote off into the blogosphere in the fond hope that it will show up as a black mark on Expedia's record. In Net Promoter terms, I am now a detractor. Sorry Expedia.

But I'm fair. If they can resolve this problem quickly and to my satisfaction then I will post that result too and perhaps consider them again when I travel on business. I'm not going to call them; however, but I've given them my cell phone number so they can call me. I usually answer it right away and there is no annoying phone triage for my callers so they will get to a real person immediately.

What does this have to do with cloud computing? What if I could boil the Internet and distill out for you what people are saying about your company and your products?

Saturday, March 29, 2008

How Big is a Petabyte, Anyway?

I woke up a bit early this morning wondering how to describe a petabyte. I can easily count to a hundred and do math in that range. In the thousands and above, I resort to scientific notation. A petabyte is 1,000,000,000,000,000 bytes, or 10^15 bytes. How do I convey that incredible size in a comprehensible way?

Well, most of us nowdays have a gigabyte of memory in our laptops. A thousand laptops is a terabyte and a million laptops is a petabyte. What if you could make a million laptops all work together on the same problem? What could they do? What would you want them to do?

There are about two hundred billion stars in our Milky Way Galaxy, or 2x10^11 stars. Five thousand Milky Way galaxies would contain 10^15 stars, a petastar.

If you started typing and typed a petabyte of data it would show on the screen as a very long string. On the screen, how long would it be? Well, if you typed 5 characters per centimeter, then 10^15 characters would be 2x10^14 centimeters long, two billion kilometers.

Ok, thats still pretty incomprehensible. Let's see how long the beam of a flashlight takes to go from one end to the other. Light travels at 3x10^10 cm/sec or 6,666 seconds to traverse the petabyte string, almost two hours.

This is still hard to grasp. Consider instead getting on a jet and flying at 1000 km/hr over the string. The trip would take 2x10^6 hours, about 220 years. Better fly first class.

Of course, you could not type a petabyte yourself in your lifetime, nor could you and all of your friends. But the Web is perhaps a tenth of a petabyte or so right now and is still growing really fast. Lots of people are typing at the same time and computers are helping them. With Hadoop on a good sized cloud, you can run analytics over that dataset in reasonable time. What kind of questions do you want to ask?

Thursday, March 27, 2008

The First Hadoop Summit

On March 25th, I attended the first Hadoop Summit. When I got to the conference, I picked up my t-shirt and introduced myself to Ajay Anand, the Hadoop product manager and conference organizer. What had started out as a small, local workshop in the minds of the organizers had mushroomed into an overnight sensation. The original venue had space for perhaps a hundred participants and was booked full within a day of the registration. After finding a bigger room at Yahoo! which was also immediately filled, they partnered with Amazon Web Services to move the venue to the Network Meeting Center in Santa Clara, CA. By the time I arrived, that venue was filled to standing room only. I went into the auditorium and found a seat next to a gentleman who is head of Emerging Technology of a Korean company. He told me he has a 200 node cluster and is interested in new marketing applications that are now possible using this technology. There are lots of similar business opportunities awaiting leading edge adopters of Hadoop.

Ajay opened the conference and introduced Doug Cutting and Eric Baldeshweieler who gave a historical overview of the Hadoop evolution up to where it is today in production at Y! Hadoop began its life as a part of the Apache Lucene Nutch project, which needed a distributed file system to store the web pages returned by its crawlers. They were aware of the work being done at Google and wanted to exploit the Map/Reduce paradigm to run computations over these very large data sets. The project snowballed with the support of an active, worldwide, open source community abetted by Yahoo! investments and has recently become a top level Apache project of its own right.

Five different speakers followed this introduction that each described work being done on top of the Hadoop platform.

Chris Olston (Y!) gave a nice introduction to Pig, which I have explored a bit and have found to be quite powerful. "Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets."
Chris, Kevin Beyer (IBM) gave a talk on JAQL which is a new, more SQL-like, query language for processing JSON data on top of Hadoop.
Michael Isard (Microsoft Research) described DryadLINQ, a highly parallel environment for developing computations on a cloud computing infrastructure. He showed that map/reduce computations can be phrased quite simply using their language. The reaction of several people I spoke with was, unfortunately, "too bad it is buried inside of Microsoft's platform".
Andy Konsinski (UC Berkeley) talked about the X-trace monitoring framework they had embedded inside of Hadoop adding only about 500 lines of code. This seems to be potentially useful in understanding the actual behavior of M/R jobs and they promise to clean it up and submit it as a patch.
Ben Reed (Y!) discussed Zookeeper, a hierarchical namespace directory service that can be used for coordinating and communicating between multiple user jobs on Hadoop.

After lunch Michael Stack (Powerset) gave an introduction to HBase, a scalable, robust, column-oriented database that is build upon the Hadoop distributed file system. The project is in its second year and is based upon BigTable, another Google technology. It stores very large tables which can be accessed by row primary key, column name and a timestamp. I've not yet experimented with HBase, but will likely need to utilize it in my Mahout work for storing and manipulating very large vectors and matrices. Afterwards, Brian Duxbury (Rapleaf) described how HBase and Hadoop are used to search the Web for information about people's reputations that can be gleaned from various online sources. How can I influence that score?

There were several additional talks that addressed application level work being done on top of the Hadoop platform:

Jinesh Varia (Amazon) talked about how they deploy GrepTheWeb jobs on Hadoop clusters that are materialized on EC2 to run and then vanish when they are finished. This is an example of the kind of technology that is now available to anybody with a wallet and a good map/reduce algorithm that they want to use for generating business value.
Steve Schlosser (Intel) and David O’Hallaron (CMU) talked about building ground models of Southern California, using Hadoop in a novel processing application of seismic data.
Mike Haley (Autodesk) talked about how they are using classification algorithms and Hadoop to correlate the product catalogs of building parts suppliers into their graphical component library that is used for CAD.
Christian Kunz (Y!) described their recently-announced production use of Hadoop. He showed some very big numbers and impressive improvements over their previous technology in terms of scale, reliability, manageability and speed. To generate their web search index, they routinely run 100k map and 10k reduce jobs over hundreds of terabytes of web data using a cluster with 10k cores and 20 petabytes of disk space. This illustrates what is now possible to do in production settings with Hadoop.
Jimmy Lin (University of MD) and Christophe Bisciglia (Google) talked about natural language processing work going on at UMD and other universities. I got a chance to shake Christophe’s hand during the happy hour and to thank him for revolutionizing my life (see My Hadoop Odyssey, below).

I had a fantastic opportunity to sit on the futures panel with leaders of the Hadoop community (Sameer Paranjpye, Sanjay Radia, Owen O’Malley (all Y!) and Chad Walters (Powerset)) to introduce the new Mahout project while they presented the future directions of Hadoop and Hbase. The panel gave me an outstanding soapbox, generated a lot of interest in machine learning applications and several great opportunities for followup discussions with people from the greater Hadoop community.

Wednesday, March 26, 2008

What is Mahout?

Around the end of January I saw an interesting post on the Hadoop users list announcing the creation of a new sub-project called Mahout under the Apache Lucene project. I decided this would be a good place to continue my Hadoop odyssey.

Using cloud computing technologies such as EC2, Lucene, Nutch, Hadoop, Pig and Hbase it is now possible for even small companies to perform analytics over the entire Worldwide Web. The emerging challenge is now to develop improved analytics that can separate relevant information from spam, learn from previous experience and organize information in ever more meaningful ways.

In recent years a rather large community of researchers has addressed the problem of extracting useful intelligence from the Web. Whether is it classifying documents into categories, clustering them to form groups that make sense to users or ranking them by relevancy given some query, these methods fall under the broad category of machine learning algorithms. Unfortunately, most of the available algorithms are either proprietary, under restrictive licenses or do not scale to massive amounts of information.

The focus of the Mahout project is to develop commercially-friendly, scalable machine learning algorithms such as classification, clustering, regression and dimension reduction under the Apache brand and on top of Hadoop. Its initial areas of focus are to build out the ten machine learning libraries detailed in Map-Reduce for Machine Learning on Multicore, by Chu, Kim, Liu, Yu, Bradski, Ng & Olukotun of Stanford University. Though the project is only in its second month, we have an active and growing community with initial submissions in the areas of clustering, classification and matrix operations.

The Mahout team chose this name for the project out of admiration and respect for work of the Hadoop project, whose logo is that of an elephant. According to Wikipedia, “A mahout is a person who drives an elephant”. It goes on to say that the “Sanskrit language distinguishes three types [of mahouts]: Reghawan, who use love to control their elephants, Yukthiman, who use ingenuity to outsmart them and Balwan, who control elephants with cruelty”. We intend to practice only in the first two categories and welcome individuals with similar values who would like to contribute to the project.

My Hadoop Odyssey

It all started on a lazy Sunday afternoon back in December. The new issue of Business Week had just arrived and it had an interesting-looking cover about a new technology called "cloud computing". The long-haired guy on the cover looked somewhat out of place for the usual BW fair, but I had never heard about the technology and so it piqued my interest. Reading through the article, it turned out to be about massively parallel computing platforms and their emerging impact on the business world of web-scale information processing. The guy on the cover, Christophe Bisciglia, was a young engineer at Google who had become their cloud computing guru and was working with some universities to establish curricula for teaching students about this new technology.

As a necessary ingredient of their web search business, Google had developed a way to use massive arrays of general purpose computers as a single computational platform. They had racks and racks full of off-the-shelf PCs, each with its own memory and disk drive. Using proprietary technology based upon an old functional programming technique called Map/Reduce they are able to store massive amounts of web search data redundantly on these computer arrays and run jobs over a database consisting of the entire world wide web. I'd always wondered how they did it.

The article went on to mention Hadoop, an open source version of this technology that was being developed by Yahoo!, IBM, Amazon and other companies to make this technology available under the Apache Software Foundation's flexible licensing terms. Though this was a competitive effort to the Google work, it was behind them on the learning curve and it provided an open platform to train young engineers to think in terms of these massively parallel computations.

It also provided me with a window into this new technology. That evening, I downloaded a copy and started reading the documentation. By midnight, I had a 1-node Hadoop cloud running on my laptop and was running some of the example jobs. The next day I want into my office at CollabNet and commandeered a few of my colleague's Linux boxes to build a 12-node cloud that had over a terabyte of storage and a dozen CPUs. Then I went looking for some data to munch with it.

CollabNet's in the globally-distributed software development business, not the web search business and so about the only large sources of data we had were the logs from our Apache web servers. I got ops to give me about 6 gb of logs and started writing a program to extract some usage information from them. In short time, I had my first map/reduce application tested using the very fine Eclipse plugin provided by IBM. I ran it on a single-node cluster against 5 months of logs and the program took about 120 minutes to complete. Then I launched it on my 12-node cloud and it took only 12 minutes - almost linear scalability with cluster size. This really cemented my interest.

There was one aspect of CollabNet's business that, I felt, might benefit from this technology. CUBiT, a nascent product designed to manage pools of developer boxes, allows engineers to check a machine out from a pool, install a particular operating system profile on it, check out a particular version of our software, build it and use it for testing. Using the CUBiT user interface, I was able to see that we had literally hundreds of Linux boxes in use around the company. I was also able to see that most of them were only about 2-5% utilized, sitting there stirring and warming the air in their machine rooms most of the time just waiting for engineers to need them.

We were sitting on top of a massive supercomputer and did not even realize it! How many of our customers had similar environments? Our customers included major Fortune 1000 corporations. Probably ours was one of the smallest latent clouds around. What if we bundled Hadoop into CUBiT? It would be totally opportunistic as a product feature but it would enable our customers to develop and run jobs that they never even dreamed were possible, right in their own laboratories, for free. "Buy CUBiT, get a free supercomputer" became my mantra.

Jeff on Cloud Computing