What it takes to build great machine learning products

gfodor · on April 16, 2012

A great and insightful article. A common theme I've seen in practice is folks who have a deep understanding of ML often run straight to applying the most sophisticated algorithms possible on raw data. On the other hand, people who know a bit about ML but understand the domain better start by applying intuition to data cleansing and then follow up with simpler algorithms. Without fail the latter group ends up with better results.

jmount · on April 16, 2012

My definition of a "deep understanding of ML" definitely excludes people who immediately try "the most sophisticated algorithm." Buzzword jockies try all the cool stuff first, but most practitioners I have met try basic statistical methods first. Then when they see what issue they need to overcome they bring in a method designed to help with that issue.

ma2rten · on April 16, 2012

Really? I'd think that most ML people are well aware of the importance of data cleansing and feature extraction. Also my experience is that domain knowledge often (but not always - depends on the domain) helps surprisingly little. Feature extraction is mostly an iterative approach anyway: you define some very simple features, you look at the mistakes, you add some features and repeat until you are happy. Ideally you also do some visualization in there somewhere.

tel · on April 16, 2012

I think there are essentially two "deep" understandings of ML prevalent today. The first is more common: the ability to do the calculus, algebra, and probability derivations required to design complex ML algorithms combined with the CS knowledge to find/design a good algorithm and the software design skill to actually implement it on real, "big" data.

No doubt this is a difficult position to master and those who perform well are able to tackle lots of mathematical and computational challenges. They also are model builders who (have tendency to) relentlessly seek complex models in order to solve complex problems.

The other, rarer side is the learning theorist who may or may not understand the model building, algorithmic, and computational tools but understands well the theories which allow us to have reasonable expectations that the tools of the first group will work at all. These guys have a funny story in that they were the old statisticians who got a major egg-on-the-face after proclaiming that essentially all of ML was impossible. Turns out the first group managed to redefine the problem slightly and make major headway (and money).

---

The thing I want to bring to light however is that the second group knows the math that bounds the capacities of ML algorithms. This isn't easy. It's one thing to say you recognize that the curse of dimensionality exists, but it's another to have felt it's mathematical curves and to build an intuition for what forces are sufficient to cause disruption.

The more experience you have with the learning maths, the more likely you are, I feel, to apply very simple algorithms, to be scared of "little x's" (real data) enough to treat it with great care, and to attempt to explore the problem space with confidence for what steps will lead you to folly.

---

It's a fine line between the two, though. Stray too far to the first group and you'll spend a month building an algorithm that does a millionth of a percentage point better than Fisher's LDA. Spend too much time in the second camp and you'll confidently state that no algorithm exists that does better than a millionth of a percentage point over Fisher LDA... and then lose purely by never trying.

sireat · on April 17, 2012

Our Data Mining (different field but somewhat related) professor had this quote "All models are wrong, but some models are useful" on the first day of university.

You can build an extremely complicated model that is not useful, where a simpler one might suffice.

tel · on April 17, 2012

I've heard that line referred to as Box's Razor. It's definitely the right heuristic, but it's interesting to see that even if your model is right, you're still in trouble if it's too complex. This is a sort of bias/variance tradeoff.

irahul · on April 16, 2012

> On the other hand, people who know a bit about ML but understand the domain better start by applying intuition to data cleansing and then follow up with simpler algorithms.

I find data cleansing(if you are including feature selection) hard, and I consider it a refinement. If I am working on a classification problem, I start with naive bayes with a trivial feature generator(if words are feature, split on whitespace and discard some symbols), train it, and cross validate. Depending on the results of the cross validation on differently sized data-sets(say 100 tweets, 200, 500, 1000, 2000, 5000) I decide if I refine bayes further or I need to pick another algorithm.

I avoid SVM because I have a hard time figuring out the kernel and relation between data. I mostly don't use linear classifiers because the relation is very rarely linear.

Generally if the features are pseudo-independent(naive bayes assumes independent events but it might work fine even if the events aren't independent), naive bayes does the job. If not, it's time to refine the feature generator and selector.

robrenaud · on April 16, 2012

Naive Bayes is a linear classifier, and it makes much stronger assumptions than other linear classifiers.

irahul · on April 16, 2012

My bad.

Regarding stronger assumptions, is there anything other than independence(thus the name naive) that it assumes?

tensor · on April 16, 2012

That's the main assumption that people care about. In contrast, logistic regression (aka maximum entropy) does not make this assumption. As a go-to first classifier I would suggest multi-class logistic regression with regularization.

joe_the_user · on April 17, 2012

That has been my experience with AI programming.

But I would take a more pessimistic interpretation of this.

That is: all our "learning algorithms" has failed to learn and those with some clever heuristics succeed versus the broken methods we so-far have.

cema · on April 17, 2012

I would offer a somewhat more optimistic take: humans are still better learners than machines are, and our algorithms still do not capture well the way our thinking (and intuition) works. Really, we have only started with machine learning a few decades ago; catching up with millions of years of evolution can be expected to take a bit longer.

joe_the_user · on April 17, 2012

But I don't really think of that as positive.

Maybe its the pain of my previous AI job talking but when the choice is just an opaque hunch of an expert, it doesn't feel like a victory for human intelligence. A victory for human intelligence looks much like the discovery of a physical law where you both deal with a phenomenon and communicate how someone else can also deal with it.

What is quintessentially human in the modern sense is human beings understanding ourselves rather than

mturmon · on April 16, 2012

Agreed. Another ingredient is sustained engagement with the problem, so that your algorithm works not just for a pre-selected demo, but actually provides noticeable performance gains for real data.

tgflynn · on April 16, 2012

I agree that the big wins in machine learning/(weak)AI are probably going to come more from figuring out how to better apply existing models and algorithms to real problems rather than from improving the performance of the algorithms themselves.

That said one shouldn't underestimate the amount of commonality between problems that to some people may appear unrelated. For example this post talks about the gains in machine translation performance from including larger contexts. The same principle applies to many other sequence learning problems. For example you have a very similar issue with handwriting recognition where it is often not possible (even for a human) to determine the correct letter classification for a given handwritten character without seeing it within the context of the word.

chaostheory · on April 16, 2012

The article is light on details. imo there are two major things your team needs:

1) Programmers that have the needed math skills, or mathematicians with the needed coding skills

2) A distributed ML framework

Solving problem one is not easy but it's straightforward.

Solving problem two is harder. While there are a lot of open source machine learning projects, almost all of them seem to have a focus of being used by a person and not a program. Moreover very few do distributed processing except for mahout (http://mahout.apache.org/). Mahout is promising but the documentation is still thin, and I'm not sure if it's getting momentum in terms of mind share yet.

suneilp · on April 16, 2012

What kind of math skills? What would a programmer need to learn in order to work on ML stuff?

salimmadjd · on April 16, 2012

Aside from algebra to do log or exp division addition and multiplications, you have to be versed in statistics. Most ml problems are solved on statistical bases. Although many of the algorithms have been solved, you still need to grasp the statistics behind it which is a bit more involved than calculating the odds of a die

chaostheory · on April 16, 2012

Yes, many people that I know working on ML do not remember statistics well enough. Some keep their ignorance while relying on mathematicians on-staff (who can't really code), while others either start buying college textbooks or go to night classes. There are too many people who don't understand the algorithms being used.

Dn_Ab · on April 16, 2012

To work with ML stuff you don't need much. You can just download packages and get experience with how to best pick models, choose features and tweak (hyper)parameters. If you want to work on or understand it then you will need math.

You need a decent understanding of calculus (mid 1800s level, mulitvariate calculus), a more decent understanding of Linear Algebra (1950s ), information theory (1960s), and probability and statistics. With the last having shifted the most from the past due to more recent respect for bayesian methods. Note the years in parenthesis is not to say that nothing new has been used from those areas, more like if you pick up a book on that topic from that year you would be pretty well covered for the purposes of ML.

Worth having a vague idea of are stuff like PAC learning, topology and computational complexity stuff like Valiant's work on evolvability. If you are doing stuff related to genetic programming then category and type theory have riches to be plundered.

Or if you want to be more hardcore and are looking at very higher dimensional data and reductions on them you might look at algebraic geometry (in particular algebraic varieties) and group theory. So basically the answer to your question is as little or as much math as you want and or depending on the problem and your interests in trying different approaches than the typical toolkits of linear algebra and statistics.

nkh · on April 16, 2012

* If you are doing stuff related to genetic programming then category and type theory have riches to be plundered.*

Could you expand this a bit as I don't understand the meaning. Are you saying that if the problem you are working on can be solved with genetic algorithms, then you could blow it away with category and type theory?

I don't have a vested interest in either, I am just curious. Thanks.

o1iver · on April 16, 2012

He said genetic programming not genetic algorithms. For the former, type and category theory are useful!

Dn_Ab · on April 16, 2012

In case anyone clicks his link to Variational Methods and is confused to find an article on quantum mechanics, as inspirational and arguably related as it may be, I think he actually meant to link to: http://en.wikipedia.org/wiki/Variational_Bayesian_methods

ma2rten · on April 16, 2012

Right now NLP is mostly limited to niche applications, like e.g. sentiment analysis and clever products build around it. I actually think the reason is that both natural language processing and machine learning are still in their early days.

Imagine all the applications for consumer products if algorithms would be really able to actually understand language (as far as you can understand something if you are a computer program and not a sentient human being), e.g. for example if we were able to do real text summarization.

I believe this is not only possible, but not as far away as people think. However, to reach that goal we need to let go of that idea that NLP is mostly about clever feature engineering, but instead start building algorithm that derive those features themselves. Part of the problem is how evaluation is setup in NLP. What the best algorithm is, is decided based on who gets the best performance on some dataset. This sounds all nice and objective, but you will always be able to get the best performance if you try enough combinations of features (overfitting the testset) [1]. These small improvements say little about real world performance.

For the NLP people among you this is an interesting paper that tries to do a lot of things different: http://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf

This is the corresponding tutorial, which is quite entertaining as well: http://videolectures.net/nips09_collobert_weston_dlnl/

[1] I think, this is less true for machine translation, where there are more and bigger testsets and less feature engineering going on.

brendano · on April 16, 2012

Careful with the Collobert ICML-2008 paper. It has a very negative reputation among NLP researchers who actually know the area, just for its setup/evaluation. If you're interested in the methods (which I think are interesting), that group's later work is much improved.

ma2rten · on April 16, 2012

Thanks I will look into it.

ogrisel · on April 16, 2012

Very nice article Aria. You quickly mention Pegasos as a scalable alternative to SMO. I agree that this works well for linear models. But despite the claim that Pegasos can be trivially adapted to kernel models I have never seen any implementation of a kernel Pegasos and I don't understand how it's even possible. Have you used Pegasos-style algorithm to fit non linear models?

On the other hand there exist alternatives such as LaSVM that can effectively scale linearly to large datasets (but the optimizer works in dual representation as with SMO and not like Pegasos).

psb217 · on April 16, 2012

You may want to look at the paper: "P-packSVM: Parallel Primal grAdient desCent Kernel SVM" from ICDM 2009. It presents an extension of Pegasos to non-linear kernels. Evaluating the pairwise kernels <x_i,x_j> and continuously updating the estimate of the norm of the implicit weight vector w seem to be the main hurdles to achieving the performance gains seen with linear kernels.

The key takeaway from the paper (for me) was that the computation time on a single processor was not significantly better than that of the standard implementation provided by SVM-Light. However, with a variety of tricks permitted by the use of an SGD/Pegasos-like method, the authors were able to get significant speedup when using a compute cluster, allowing a good reduction in computation times (e.g. ~200x reduction on 512 processors).

brendano · on April 16, 2012

For NLP applications, which I think Aria's article is mostly concerned with, non-linear kernelized classifiers are often little better than linear ones. I think that's one part of the recent interest in SGD-style training algorithms (they work for linear cases nicely, less so for kernelized ones).

[deleted part about kernelizing pegasos, realized i dont know that area]

srconstantin · on April 16, 2012

Link to LaSVM paper: jmlr.csail.mit.edu/papers/volume6/bordes05a/bordes05a.pdf Also a good overview of SVM techniques in general.

srconstantin · on April 16, 2012

So...are you saying you need the dual formulation in order to allow a kernel model?

ogrisel · on April 16, 2012

No actually it's not the case. But I don't know how Pegasos can be adapted to use kernels. If you take figure 1 of the paper [1] you will see that the gradient of the objective function is used to update a single weights vector `w` at each step of the projected stochastic gradient descent. In a kernel model, all the support vectors cannot be collapsed into a single weight vector `w`. You would need to handle the kernel expansion against the support vectors explicitly. But then how to select the support vectors out of all the samples from the dataset while keeping the algorithm online? The Pegasos paper does not say mention it.

[1] http://eprints.pascal-network.org/archive/00004062/01/Shalev...

sparsevector · on April 16, 2012

The set of support vectors is just the set of training examples that have non-zero alpha parameters. To implement the gradient update you just evaluate the support vector machine on the example (using the explicit kernel expansion) and then if the example has signed margin less than 1 you add y * eta to the corresponding alpha value.

The difficulty with Pegasos for non linear kernels is the support set quickly becomes very large and so evaluating the model becomes very slow. Note that since the alpha values are not constrained to be non-negative (unlike the standard dual algorithms) the alpha values don't ever get clipped to zero--instead they just slowly converge to zero. It's still (I think) one of the fastest methods in terms of theoretical convergence guarantees but perhaps not as fast as LaSVM or something similar in practice.

However, there's been a more general trend in machine learning to use linear models with lots of features instead of kernel models, partially because of these sort scalability issues.

ogrisel · on April 16, 2012

Thanks for the reply. I was told by twitter that this is similar to the kernel perceptron which I don't know well either. There is a good introduction with python code snippet here:

http://www.mblondel.org/journal/2010/10/31/kernel-perceptron...

However it seems that you need to compute the kernel expansion on the full set of samples (or maybe just the accumulated past samples?): this does not sound very online to me...

sparsevector · on April 16, 2012

It's true you need to compute the kernel dot product between every example you see and every example in the support set (every example that ever previously evaluated to signed margin < 1). Whether it's online depends on your definition of "online" . It's definitely not online in the sense of using memory independent of the number of examples, since you have to keep around the support set. I think there are results showing the support set grows linearly with the size of the training set under reasonable assumptions. However, it is online in the sense that it operates on a stream of data, computing predictions and updates for each example one-by-one. It's also online in the sense that its analysis is based on online learning theory (e.g. mistake / regret bounds). A lot of learning theory papers use "online" in the latter two senses, which is confusing if you expect the former.

TimPC · on April 16, 2012

It's a very exciting time. I'm incredibly excited to see what goes on here. I previously explored an online education start-up idea and I'm really looking forward to seeing Ng and Koller change the world. I'm also very exciting to see machine learning on the radar. For me one of the biggest challenges is often making AI intuitive. As machine learning becomes more mainstream it will be on people's design radar and that will make it less hard to turn great algorithms into great products.

3pt14159 · on April 16, 2012

Partially, although in my experience over the past 4 years doing this stuff 1 hour cleaning the input data gets you thrice the output of 1 hour tuning the algos. Some algorithms are more sensitive than others, but in general, garbage in, garbage out.

TimPC · on April 16, 2012

I think in many cases you're correct. My point wasn't about the performance of the AI algorithms themselves though. In my experience most of the problems where I want to use AI the algorithm itself performs adequately. Getting the interaction with the algorithm sensible for a non-technical user is hard. If AI becomes prevalent enough that UI/UX people start thinking about it, I suspect it will be much easier to solve that problem, which to me is the bigger business problem with AI.

mailshanx · on April 16, 2012

I think this is pretty accurate. Here is an example from my own thesis research: I'm using machine learning to tune an (underwater) communication link, i.e. decide what modulation / error coding algorithms/parameters will yield good data rates in a dynamic channel.

At first i tried using an off-the-self classifier to figure out which parameters will work well. That failed because by the time i had sampled a decent proportion of the possible parameter values, the channel would change (the number of possible combinations is of the order of a few millions).

Turned out that the real problem is not learning the performance of the available parameters, rather it lies in "learning how to learn": i.e. my ML system needs to adaptively search the space, by responding to the history of previous explorations and their outcomes. This kind of exploration would be effective only with an understanding of how the underlying modulation/coding algorithms work and interact with each other.

danieldk · on April 16, 2012

Indeed. From our own experience: we use pretty much off-the-shelf maximum entropy parameter estimators for parse disambiguation and fluency ranking. In the past ~10 years most of the gain has come from smart feature engineering by using linguistic insights, analyzing common classes of classification errors, etc. Beyond l1 or l2 regularization, the use of (even) more sophisticated machine learning algorithms/techniques have not yet given much, if any, improvement for these tasks in our system.

What did help in understanding models is the application of newer feature selection techniques that give a ranked list of features, such as grafting.

mwexler · on April 16, 2012

Reading this reminded me of this recent post by Chris Dixon, which is also a good read: http://cdixon.org/2012/04/14/there-are-two-ways-to-make-larg...

seamusabshere · on April 16, 2012

My for-profit company (Brighter Planet) often gets product ideas from our data scientists; it's exactly what Dr. Haghighi is talking about.

For example: trying to model environmental impact of Bill Gates's 66,000 sq ft house during a hackathon -> discovery that we need fuzzy set analysis (https://github.com/seamusabshere/fuzzy_infer) -> new, marketable capabilities in our hotel modelling product (https://github.com/brighterplanet/lodging/blob/master/lib/lo...).

salimmadjd · on April 16, 2012

I have enjoyed the author's other posts via his prismatic blog here. It's one of the most interestings blogs to follow with only a few posts so far. However, this article falls a bit short. It feels rushed out, which is understandable.

I think it would have been better if this was just the first part of a multi-article write up on ML. With this one being an intro and follow-ups on specific approaches.

junktest · on April 16, 2012

Probably try PCA (principal components analysis) to help select the most important features of data, first, before going further in modeling it.

marshallp · on April 16, 2012

The article doesn't mention two important things (and instead focuses on being clever - the opposite of what machine learning stands for). First, the deep learning algorithms that automatically create features. Second, the importance of gathering lots of data, or generating it.

If you have to be really clever with feature engineering, then what's the point of even calling yourself a machine learning person.

ogrisel · on April 16, 2012

I agree that deep-learning is an interesting approach to learn higher level features. However it's still a long way from being a universal solution: for instance deep learning won't help you solve the machine translation or multi-documents text summarization problems automagically: you still need to find good (hence often task dependent) representation for both your input data and the datastructure you are trying to learn a predictive model for.

grinalds · on April 16, 2012

Deep learning is an interesting approach - although the features that DL algorithms decide are most important are not always intuitive or weighted properly in context. Partial-feature engineering is sometimes the only way to effectively deal with biases, especially in higher-dimensional space where the DL features can be very opaque.