Thursday, April 17, 2008

I Encounter Scary Math

My first programming experiences on the Mahout project were pretty straightforward. I was working on a problem that seemed to need some clustering - grouping data that is "similar". I listened to a couple of excellent Google Tutorials that explained how to use Map/Reduce to implement Canopy Clustering. This was a small leap from my previous experiences and I was able to implement it in a few days. In the process, I learned about k-Means Clustering since canopy clusters are often used as input to k-means. Now, the math was starting to slow me down but the algorithm was still pretty simple and so I was able to make good progress.

Then came mean shift clustering. A student had posted an email expressing interest in the algorithm and in a later correspondence included a reference to a comprehensive paper on it. The math was scary, completely opaque to me, and filled with new statistical terms. I must have read that paper a dozen times before it dawned upon me what it was doing. It's not so much that I could not interpret the notation, I just have a hard time mapping it into the real world so that I can visualize its intent. One night I had a vision of hydrogen atoms floating in interstellar space, weakly attracted to each other and slowly forming into clumps - gas clusters - that ultimately became stars. Somehow, that vision morphed into vast clouds of canopy clusters moving and merging together and an implementation was born. I still cannot prove it is correct, but it worked on a test dataset and it felt right so I committed it.

Then, Ted Dunning, an active contributor to the Hadoop and Mahout mailing lists, introduced me to Dirichlet Process Clustering. Unlike the other clustering algorithms which assign each data point to a single, "best" cluster, Dirichlet allows each point to be assigned to multiple clusters - each with an associated probability. This is much more realistic, but it makes the math really, really complicated and I'm still struggling to map the notation onto reality. Each time I read all the papers it gets a little bit clearer, but I'm hoping for another vision. So far, the best I can do right now is a variation of the Chinese Restaurant Process.

Imagine a very large Chinese restaurant, with (infinitely) many tables. Each table can seat (infinitely) many patrons but only serves a single set of dishes to all of them. The first patron to sit at an empty table orders exactly what she likes from the menu for that table. When a new patron enters the restaurant, he surveys all of the tables. Each will have some items he likes and some that he does not. By comparing his likes and dislikes with the menu on each table, we can calculate the probability that he will sit at each table as well as the probability that he will choose an empty table. If the tables represent the clusters and the patrons represent the data points then some clusters will be more likely than others to contain the point. Of course, the probabilities must all add up to 1. Maybe I can get Ted to comment on this posting and my dumbed-down version of DPC.

Friday, April 11, 2008

Word of Mouth Marketing: An Expedia Experience

My wife Deborah is a marketing professional who has taught me a lot about the emerging power of word-of-mouth marketing. We have taken turns running our consulting and technology company Windward Solutions while the other has a day job. Since April it has been my turn to be 'Jeff on a jet'.

I booked my first trip through Expedia.com. It was a package deal that included airfare and a hotel booking with Embassy Suites. I arrived at my destination and found that the hotel had no record of my reservation. Fortunately, they had a room and so I checked-in. When it came time to check out; however, I was faced with a second bill to pay for the same reservation. This was an order fulfillment problem between Expedia and Embassy Suites, not my problem.

I tried to get to speak with a person at Expedia about this double payment problem, and unfortunately American Airlines had canceled thousands of flights and the wait was interminable. I sent an email to the Expedia support desk explaining the situation and providing the details. I was told that they could not handle my complaint and that I needed to wait forever in their phone queue to be helped. I'm not inclined to do that, since this is not my problem, so I called my credit card company and contested the charge.

By the way, they had charged two transactions: one for the airfare and another for the hotel. The hotel charge was $133 more than the actual charge on my bill - some package deal.

What does this have to do with word-of-mouth marketing? Well, I know that there are many people who routinely search the blogosphere for what customers are saying about their experiences with corporations. There are also companies, such as Satmetrix, Biz360 and others (help me out here Deb), that do this for a living. So, I'm sending this little anecdote off into the blogosphere in the fond hope that it will show up as a black mark on Expedia's record. In Net Promoter terms, I am now a detractor. Sorry Expedia.

But I'm fair. If they can resolve this problem quickly and to my satisfaction then I will post that result too and perhaps consider them again when I travel on business. I'm not going to call them; however, but I've given them my cell phone number so they can call me. I usually answer it right away and there is no annoying phone triage for my callers so they will get to a real person immediately.

What does this have to do with cloud computing? What if I could boil the Internet and distill out for you what people are saying about your company and your products?