Friday, September 17, 2010

Mahout committers Ted Dunning, Grant Ingersoll and I met with some of our Mahout user friends over dinner at Panera's in Millbrae last night. The study of Machine Learning for me has always been a sequence of little mysteries to solve and this evening proved to be no exception. Ted kicked off the conversation with a provocative statement that ML is really about different ways to extract [meaningful] models from large volumes of data and that classification, clustering, SVD (singular value decomposition) and recommendation are all really just different ways to skin the same cat. It seemed preposterous at first. He drew a box with lots of arrows going in on the left and just a few arrows coming out on the right to illustrate how each of these processes consume volumes of data and produce much smaller and more concise models of it. He went on to say that each of these techniques is better than its brethren at extracting certain kinds of meaning and that real world data often will require more than one of these techniques to be chained together to gain accurate insight (more meaningful models).

We've been having some discussions on the mailing list recently about how to unify our clustering and classification data structures in order to make them more "plug and play". I had done some refactoring of the clustering data structures in order to eliminate a lot of redundant code and unify their behaviors. Ted had introduced an AbstractVectorClassifier a couple of months ago as a way of unifying all the classification algorithms and was looking at one of its new subclasses, the VectorModelClassifier; in the clustering package. Where had it come from? After reviewing the code I recalled it as an experiment I'd done to see if I could integrate our new clustering models into the classification framework. I had not intended to commit it at the time and so I didn't recognize it at first but there it was: a classifier that could classify vectors based upon the model output of any of our clustering jobs. The beginnings of integration were at hand.

All of our clustering jobs can perform a final job step which assigns each input vector to one or more of the models which the clustering has produced. Said differently, they can all classify each input vector to one or more of the models. And when I think about the cluster-creation steps that our clustering algorithms all perform as training, the unification becomes even clearer. Of course, Ted pointed out, clustering is really just unsupervised classification and classification is really just supervised clustering. I think I'm starting to get it! Both consume large volumes of raw data and produce, either supervised or not, a smaller set of models that characterize the data: its meaning.

So what about SVD? Our SVD implementation uses Lanczos' algorithm to produce a set of eigenvectors and their associated eigenvalues from an input matrix. The eigenvectors and eigenvalues are typically much smaller than the original data and may be used in place of it for many computations. Hey, they're models too! The clustering of text documents; for example, typically involves a very high dimensionality, sparse, term vector for each document in a corpus. If one tries to cluster these raw vectors one often confronts "the curse of dimensionality" and the clustering does not produce useful results. If, instead, one uses SVD to first reduce the dimensionality of the term vectors and then clusters that data the results are often considerably improved. To summarize, SVD is a process which extracts a [meaningful] set of models (the eigenvectors and eigenvalues) from the data. Because it is unsupervised, might one think of it as a form of clustering? IDK. At least it is one of the Mahout services that can be chained together with clustering to produce more insightful results.

Matrices are also used a lot by our recommender services to recommend items to users based upon some metrics of user preference for each item. These co-occurrence matrices are generally large and unwieldy. In user based recommending, the goal is to recommend items to users based upon what items similar users found most interesting and the co-occurrence matrix has size equal to the number of users squared; often a huge matrix. In item-based recommending, the goal is to recommend based upon which items are similar to each other and the co-occurrence matrix has size equal to the number of items squared; usually smaller but still quite large. SVD can be used in both cases to reduce the dimensionality of the co-occurrence matrices. And so too can clustering services be used within a recommender engine to codify the similarity metrics used to make the recommendations. These services really do need to plug and play together.

Ok, I'm having a bit of an epiphany here and this may not all be spot on. But the proposition that the parts of Mahout which I've always viewed as being unrelated are actually interdependent is starting to grow on me. It's kind of a grand unification theory which may well lead to further integration and other improvements in the Mahout service portfolio as it plays out. A few mysteries got solved last night and a few more got added to the list. An evening well spent.

Saturday, May 9, 2009

A Pair of Cloud-related Talks by Me

I've been procrastinating but it's time to post an update. I've given a couple of talks recently and here are the titles, brief summaries and a little link love:
  • BI Over Petabytes: Meet Apache Mahout - I introduced the Mahout project to the SDForum Business Intelligence SIG meeting last month at SAP in Palo Alto. The talk was quite well attended and there was standing room only. After a brief overview of the project in general, I showed a comparison of the various Mahout clustering algorithms on a hypothetical astronomical dataset. Paul O'Rorke, one of the forum chairs, posted a nice blog entry on the talk here. You can get a copy of the slides from here.
  • Net Promoter in the Cloud: An Experiment on the Platform - I had a nice opportunity to fly to Sydney to give this talk at the JAOO conference there. In it I described my experiences building the application that is discussed more in the postings below. The conference was organized into three concurrent tracks so the attendees had to choose between my talk and two others. The talk which won hands down was Patrick Linsky's "How to build an iPhone application in 45 minutes" which, I must admit, I wanted to hear too.
  • was a sponsor of the show and I did get a chance to meet Clayton Brown, their local SE guru, who also gave an amazing example of building a enterprise application in 45 minutes. Not quite as much sizzle as an iPhone app, but IMHO a far more challenging problem.
  • Here's another link to Paul O'Rorke's blog where he describes another SDForum talk by Salesforce CTO Craig Wiessman titled "The Data Architecture of".
  • One interesting coincidence: Attendees were asked to rate each talk on their way out the door by dropping red, green or yellow slips of paper in a voting container. While not precisely Net Promoter procedure, they calculated a similar score by subtracting the red percentage from the green one. By this metric I had a +66% which left me feeling worthwhile in the end :).
Right now I'm in Brisbane and will give another talk to the JAOO Conference here on Tuesday. The scheduling is different: This time I will be able to see Patrick's talk and, hopefully, get a larger share of the turnout too.

Tuesday, March 10, 2009

The Power of Naked Conversations

It's kind of exciting when you get a concrete indication that somebody - a real person - has actually read your blog! Exciting and maybe a little scary too, because you never really know what goes on "out there" in the blogosphere.

So, consider how excited I was when I got two (2!) independent responses from people to my last posting: one comment to the post directly by Jon Mountjoy, the Developer Force Community Manager; and one from Jesse Lorenz, a Technical Evangelist. When I said: "I think employees could do a better job of monitoring their discussion boards, I had no idea they would find out about my discussion posting from my blog! And here I thought I was merely ranting into the ether; the great big dummy load in the Internet Sky that absorbs all inputs and returns nothing but the warm feeling in the pit of my stomach when I write.

I had another "Twilight Zone" experience after my Road Warrior Stories posting last summer. Nothing immediate happened at first, but the next time I happened to fly on Continental Air Lines I was mysteriously upgraded to first class! I had no miles, no status, but mysteriously I was in the front cabin. Go figure.

It's almost like some companies proactively search the blogosphere, looking for user stories - unhappy ones - where they can intervene to turn a potential detractor into a promoter. It is really good business: a detractor is twice as likely to kill a sale than a neutral, and promoters give you an extra 50% boost. It usually does not require moving mountains to resolve their issue either: a couple of pointers into your documentation stack; a short email message; or a free upgrade. Small warm fuzzies from large, impersonal organizations have a huge impact.

So, despite my poor experience in Cleveland last summer, I'm no longer a Continental detractor. And those two responses from Salesforce guys actually got me cooking again on my NetPromoter experiment. I'm over the hump and rockin my way to a cool little application in the universe. And, for some strange reason, there are more responses to the postings on their developer boards now than there were before. Kudos guys, may the Force be with you!

Thursday, March 5, 2009 Experiment

I've spent most of the last month continuing to explore the developer platform with an experiment to implement some NetPromoter capabilities. Building and updating my business object model was very straightforward using their web-based developer platform and Eclipse plugin. Within a week I was able to implement some simple business processes using their workflow engine that allowed me to notify my employees about detractor events, create tasks for them to do, plan and approve mitigating actions. I also found their Visualforce web platform to be quite easy to use and their Apex Java-like scripting language to be powerful and succinct.

It is; however, a huge system and learning its subtlties was rather slower than getting "Hello World" working. I started with the workbook tutorials that I got at the Cloud Connect conference. As far as it goes, the tutorial really got me off to a good start. When I got out of its wading pool; however, I encountered the volumes of help documents in their online help facility and frustration began to set in. The CRM platform has so much capability I had to spend a lot of time understanding it before I could make more progress on my own experiment. The help documents only describe the simplest of examples, leaving a lot of my questions unanswered.

I turned to their discussion boards and posted a few questions to their developer community. Perhaps I am still too much of a noob to be bothered with, or my questions did not make sense, but their community did not respond like the one I've experienced with Apache Hadoop for example. A majority of the questions posed by developers just go unanswered. I think employees could do a better job of monitoring their discussion boards so that developers in my state of learning can get across the gap between their nice toy tutorials and developing a real system.

While I'm still a little frustrated, I have not given up. From a cloud computing perspective, the platform is at the highest tier of the Infrastructure-as-service, Platform-as-service and Application-as-service pyramid. This means there is maximum functionality to leverage but also maximum vendor lock-in to use their application. Apex, while Java-like, is not Java and porting my application to another platform (e.g. open source) does not look feasible. Visualforce, a taglib-style web toolkit, looks like lock-in too. This is a big bullet to bite.

Their CRM offering; however, is well accepted world-wide and the ability for me to develop an application that can leverage their 55k+ customers' CRM artifacts and workflows is very attractive. Once I get my application working in their environment, it is fully scalable, localizable and web-service enabled. This means I can concentrate almost completely upon the features of my own product and leave all the infrastructure headaches to They even have an AppExchange to help me market and distribute my application. I gotta learn more about this stuff.

Friday, January 23, 2009

Cloud Connect Conference - Thursday

I wanted to demonstrate my application running on a real hadoop cluster on EC2, so I woke up early on Thursday to bring up a 3-node cluster using the excellent deployment scripts provided by the hadoop-18 distribution.

At the conference I was preoccupied in Java jar file hell trying to build a deployable version of my demo and did not pay good attention to the speakers. By noon I had finally gotten over that roadblock and had a jar file that would run the entire application on hadoop. After I showed David my application, he challenged me to integrate it with the Google Maps API so I also missed most of the unconference sessions that preceeded the demo session attempting that. I was able to get one zip code to show in a browser on a map but a more complete solution eluded me. And so it goes with me, often getting sucked into building things when I should be listening to and interacting with others.

At the demonstration session I gave a brief talk titled "Using Hadoop to invert data - or - How to drive a thumbtack with a pile driver". The program used Axis to extract some account data tuples from the demonstration site. It then used 48 mappers and a single reducer to invert these tuples using much the same map/reduce algorithm as Google and Yahoo! use to invert the Internet for page rank data. My demo worked, was well received and I won a nice iTouch for my labors. I thought the conference was useful and informative and I made a couple new friends in the process. I'd recommend it to others with a Net Promoter Score of 9.

Cloud Connect Conference - Wednesday

The Wednesday session began with opening remarks by David Berlind followed by a panel discussion moderated by Stephen O'Grady of RedMonk with panelists: Sam Charrington of Appistry, Alistair Croll of Bitcurrent and Bob Sutor of IBM.
  • ASPs -> SaaS -> cloud computing evolution has been around for over ten years now
  • PaaS is a more recent addition that offers the most open platform for hosting custom and proprietary applications
  • Standards, interoperability, portability and collaboration offer ways to avoid vendor lock-in
  • Companies should experiment with internal and external cloud technologies to gain perspective
  • Challenges in administration, governance, control and ownership of derivative works remain
Some questions from the audience:
  • Acquisitions create myriad application integration issues, how does the cloud help? Coexistence, interoperation and migration offer a range of approaches that are really independent of the cloud. The cloud offers the ability to mashup applications that were not possible before.
  • Larry Ellison and Richard Stahlman have been vocal critics of cloud computing. What's their beef? Some vendors thrive on lock-in and others advocate viral open software. The cloud is already here, it is thriving and it will assimilate everything.
  • Where is the cloud in terms of crossing the chasm? Email and web hosting are already on the other side, with SaaS vendors hot on their heels. Companies are cautiously entering the market but most are still on the early adopter side. Multiple layers of services from bare boxes to enterprise solutions offer many ways for companies to cross as they can benefit from the cloud's economies of scale.
The panel was followed by nine brief technology "Solution Provider Speed Geeking" pitches and demonstrations that were given in the exhibit hall. We formed up in small groups and rotated between presentations on the various vendor products to the sound of Dave's loudspeaker siren. These were then followed by more in-depth sessions by the vendors after lunch. I attended the following sessions:
  • Google App Engine - takes care of automatically scaling my web applications written on top of their Python deployment framework. They support all the tools needed to build new dynamic application involving search, maps, earth, blogs and visualization.
  • Platform - an extended Java application framework that integrates with the CRM artifacts. It has a great set of developer tools and rich new applications can be constructed and deployed easily.
  • Amazon EC2 - has released a new administration console that is a huge improvement over its predecessors.
  • Amazon Mechanical Turk - has a huge pool of "artificial artificial intelligence" workers that can be put to work on a fee-for-task basis, doing simple to complicated tasks for a sliding compensation scale from pennies to hundreds of dollars.
  • Google APIs - offer JavaScript libraries for integrating their server side applications in your web applications. Simple yet powerful to use.
Dave threw down the gauntlet to developers by offering some prizes to volunteers who would use some of these technologies to build a demonstration application for the following day. I volunteered and spent some time with a guy from exporing their package to use web services to access some account data to munch with hadoop on EC2.

It took only a few minutes to customize their quickstart application to obtain and invert some account_name and zip_code tuples in memory. I left the conference and by 11pm had a working Hadoop application that would perform the same inversion on terabytes of similar data using a supercomputer. Ironically, both programs were almost the same size!

Cloud Connect Conference - Tuesday

I just got back from the Cloud Connect Conference at the Computer History Museum in Mountain View. The conference was partly an unconference that was sponsored by Google, Amazon, Salesforce and others. David Berlind ran an energetic show that was product and technology focused and very hands-on.

The first session on Tuesday evening brought three short customer "elevator pitch" presentations from Peter Coffee of, Adam Selipsky of Amazon Web Services and Rajen Sheth of Google to a group of four IT executives: Tim Crawford from Stanford University, Carolyn Lawson of California PUC, Ronald Smith of Cadence Design Systems and Robert Loolley of Utah Technical Services.

The three vendors pitched different cloud computing products but there was a fair amount of overlap in many of their messages: "The benefits of cloud computing are clear, so why delay?"
  • Adam presented the AWS platform-as-service offerings that he equated to the development of the electric power grid in the US. "We make electricity so you don't have to." I have a little experience with EC2 and S3 and would recommend. I've been running a web server on it for some months and a 5-node Hadoop cloud more recently.
  • Rajen presented their which consist of a collection of client-side JavaScript libraries that work in concert with server-side Python services. I don't do either language very well but got some hands-on experience later in the program. This would appeal to developers building calendar, map, search and earth related web applications.
  • Peter talked about desktops burdened with too much state and IT departments benefitting from improved productivity, scalability and governance provided by the platform. It consists of a set of developer tools and web services that open up the innards of the CRM to facilitate integration of custom business applications. It is written in a Java dialect with SQL integration that really makes it easy to construct new applications.
The four potential customers asked a number of questions on the following that were fielded by the presenters:
  • Interactive Applications - Lag is a big impediment to hosting truly interactive applications remotely in the cloud
  • Migration into the Cloud - Custom applications often must be rewritten to move into cloud deployment. Email and public website hosting were offered as no-brainer cloud services already in full production. Customers can leverage the innovation scale of cloud providers to gain business advantage.
  • Migration between Cloud vendors - Vendor lock-in is an issue since some of the platforms rely upon proprietary languages and all proprietary software frameworks discourage migration. Open source and standards were offered as mitigating lock-in but premature standards only help the established early providers.
  • Security - A general uneasiness with allowing private data to be hosted in the cloud was expressed. Vendors responded that their large investments in state of the art security lended economies of scale in the quest for data security.
  • Privacy - Once private data is cloud hosted it needs strict access controls to ensure its integrity. Vendors pointed out that lots of corporate data is lost every year to laptop theft and loss of USB keys and that the cloud offers better governance.
  • Legal Uncertainties - The cloud is so new that many legal issues about data ownership and rights to disclosure are untested in the courts.