Archives for the month of: March, 2009

 

When I began experimenting with machine learning as an undergrad in 1995, I was immediately taken with the possibilities. Contemplating biological evolution, it’s hard not to be awed that Darwin’s “endless forms most beautiful and most wonderful” have evolved from a blind evolutionary process, without invention from a conscious designer. Similarly, with machine learning, there is the enticing prospect that programmers won’t actually have to solve the hardest problems in engineering and artificial intelligence. Rather, with an appropriately advanced theory of machine learning (and machine evolution), we’ll evolve and train machines to solve these difficult problems on their own.On their own! The very thought is powerful, and magical, and maybe a little heretical too.

Despite my continuing enthusiasm for machine learning, I hear from skeptics all the time.Some are my friends. They are understandably burned out by overzealous claims about machine intelligence (e.g., the first chess playing computers, Doug Lenat’s Cyc Project, the mid 80’s literature on neural networks, the early 90’s literature on artifical life, Ray Kurzweil’s books, and so on). Time after time, we hear about some new AI research program that promises to introduce intelligent machines to the world, only to have these fizzle out or hit a plateau after a few years.

Well, I’m sorry to say, the age of truly intelligent machines is not near. We are still many decades (if not many centuries) away from producing machines that can hold a good conversation, contemplate their own existence, or watch Stephen Colbert and have a genuine laugh. Nope, no conscious robots anytime soon. Humans are still much smarter than machines.

Still, I’m as enthusiastic as ever about machine learning because we are at the point where machines can learn autonomously to solve many large, interesting problems. Machine learning thrives on big data sets and lots of CPU cycles to crunch them. Thanks to the internet and cloud computing we now have more of both than ever. If you’re an AI skeptic, then I encourage you to stop worrying about conscious robots, and start thinking about the more immediate, practical applications of machine learning. There are many.

With machine learning, the programmer becomes a meta-programmer; instead of finding and coding the solution to a problem, she (1) gathers an appropriate data set, and (2) codes an appropriate learning algorithm that (if all goes well) will learn to solve the problem by examining this data set. For many programming challenges, this is more feasible and cost-effective than finding and coding the solution directly.

Being a meta-programmer in this sense requires a different skills than traditional programming. You need an understanding the practical requirements and limitations of various learning algorithms (though not necessarily how to implement them, since open source implementations of many algorithms are available).You also have to be crafty about gathering training data. Can you scrape data from the web? Can you assemble a training set using Mechanical Turk or other human computation services? If you run a web service, can you collect the data from your users? Is a free or licensable data set already available? To a large extent, meta-programming through machine learning is about clever data acquisition; gathering the richest data set you can with as little time and money as possible.


I don’t how, but at some point in the last two years I became obsessed with the problem of summarizing opinion information on the web: there are just too many opinions for our limited minds to absorb, many of them valuable, and we need help making sense of them all. For me, the solution is a summarization tool — something that can scan through pages of opinions, and report the overall gist or consensus in just a few lines (or perhaps, with a pithy visualization).

Why do I care so much about opinion summarization? Because I believe strongly in “wisdom of the crowd” effects —  that we can often get better solutions to problems by synthesizing information from individuals — and yet, I believe that we sometimes lack good mechanisms for information synthesis. Wikis have proven to be an excellent way to synthesize factual information. Numerical averaging works reasonably well to synthesize opinions about numerical values. Voting and surveys help us synthesize opinions about decisions when there are a small, discrete number of alternatives. And there are other solutions too, such as prediction markets, Digg-style voting, etc.

Yet I believe we are missing good mechanisms for synthesizing qualitative opinions. By that, I mean opinions about people’s feelings, attitudes, likes, dislikes, desires, reasons, etc. What do people think about Obama? Well, there are generally two ways to find out. First, if you check the presidential approval polls you’ll see that nearly 70% of Americans are currently confident in Obama as a president. That’s definitely a synthesis of opinions, but it leave out all the details: it says nothing about why Americans like Obama, what they like or dislike about him, and more importantly whether youshould approve of Obama or not. The second major way to find out what people think is to listen to or read opinions from a variety of sources (op-ed articles, twitter, blogs, your friends, pundits on TV, etc.). There are a ton of opinions on Obama out there, roughly 10 created every minute on Twitter alone, but it would take an inordinate amount of time to read them.

Wouldn’t it be nice if you could click a button and immediately get a summary of all those opinions? For example, click a button have your computer report that the biggest single topic of discussion is Obama’s economic plan, and that though most americans are supportive, many of them apprehensive about the massive federal debt that will be created.  (Ok, I admit that’s a fictional example, those are just my opinions). If you had this button, you could take a stack of opinions on any topic and immediately get the gist of what people have said, the points where they agreed on, and the points where they disagreed.

Last year, I launched Pluribo with my friend Samidh Chakrabarti as modest initial solution to the problem. Pluribo is an NLP-driven tool to summarize user reviews for certain product categories. We began with electronics reviews and it worked pretty well. We then tried to expand to other categories and we started to run into problems. Yes, our solution worked for other categories, but  each category required so much calibration and training that it turned out to be inefficient to cover all categories, one by one. A hotel version of Pluribo is in the works, but for the most part I have become convinced that a fundamentally different solution is necessary. Rather than beginning with an algorithm for one category and then trying to extend the approach to other categories, I’m now interested in algorithms that are domain-neutral from the start.

What I want is universal tool for opinion synthesis, a tool that can accept multiple written opinions in any format, on any topic, and automatically provide a coherent summary. I do think this is possible. I am convinced the NLP technology we developed at Pluribo is not the right foundation for this, and that a fundamentally different approach is needed. Perhaps the new solution doesn’t use much NLP at all. Perhaps it mainly uses human computation to effectively decompose the tasks of aggregating, analyzing and synthesizing opinions; or perhaps there is a relatively simple machine learning solution that merely requires massive amounts of training data. Of course, what we consider to be a solution depends a lot on what we consider an acceptable summary to be.

I’m working on a human computation approach to qualitative summarization now. The approach is expensive to run, but it is domain neutral and could make sense for certain applications such as helping government agencies listen to their constituents, or helping brands listen to their customers. I’m curious to see where it will lead.

Follow

Get every new post delivered to your Inbox.