Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
“When you have enough data, sometimes, you don’t have to be too clever” (strafenet.com)
45 points by smarterchild on Dec 17, 2011 | hide | past | favorite | 13 comments


When data is easy to collect, someone will ask you to collect it and someone else will query the data and compile a report with percentages in it. Then someone else will worry about some of the percentages being less or more than some benchmark. Then your work life will become less happy.

Example 1: Some years ago, I had to sit through a meeting where a committee worried about a 2% drop in satisfaction scores on a student questionnaire. No-one checked how many replies were involved (around 400, so it worked out to about 6 people less in the second year than the first as the ratings were something like 75%).

Example 2: I recently had to add comments in a record system about students whose attendance percentage had dropped below 90%. That was 8 weeks into the course...


Sometimes I get the feeling that when we had less data, we were forced to think harder and more daringly. I feel we lack new groundbreaking theoretical framework because of this.

I don't know if Newton's law's would jump out of the paper if you simply threw a ball at one million different vectors.


On the other hand, a lot of new research (including possibly ground-breaking theoretical results) are only possible now that we have access to large data.

We might be initially processing the large data using relatively simple techniques, but on the reduced data, we can now run more sophisticated methods that actually work because the underlying data comes from a huge number of samples.

As but one example, in computer vision, the concept of "attributes" -- automatically labeling objects using descriptive words instead of categorical ones, i.e., "this thing is like..." rather than "this thing is..." -- has opened the door to a number of exciting advances. One is the concept of "zero-shot learning": automatically recognizing an object that you've never seen an instance of before simply via a description. For example, one could recognize beavers as "small, four-legged furry rodents with big teeth and a flat tail", without having ever seen a beaver before. The training data for this classifier need not include beavers, but only images which match the individual attributes, not necessarily all in the same image -- small, four-legged, furry, rodent, big teeth, flat tail.

This kind of thing was not really possible before, because there just wasn't enough data to train reliable classifiers for each attribute in any kind of automated way.

Finally, as I alluded to at the beginning, these individual attribute classifiers are often relatively simple algorithms, such as Support Vector Machines (SVMs). Yet, the 2nd-stage algorithms that use the attribute values to do something useful, such as the zero-shot learning application described above, are often much more involved/advanced techniques.


I recently visited the Galapagos Islands. There are 2 things that made it possible for Darwin to work out his theory after visiting here.

1. Remoteness of location - few outside influences 2. Relatively few species!

Even though it's on the equator, the islands aren't all jungle and animals. The sheer lack of different species made it possible to see every single one of them in a single visit, and allowed Darwin to theorize without thinking he missed something.

Sometimes, simplicity helps with focus


There's still quite a bit of work going on in various directions; "simple theory on big data" is just one of many research agendas in statistics/ML, and it's not really the majority one (though it gets quite a bit of press). It's what Google pushes in part because it's their competitive advantage: they have more data, and the ability to access/manipulate it in reasonable time, than many other places do, so it makes sense for them to see what they can get out of it.


On the bright side with so many people online and with different perspectives, it becomes easier to expose flaws, mediocre interpretations, etc.


Well, use data at large scale is the new groundbreaking theoretical framework. And it's practical too.


This is pretty much the same conclusion Ilya Grigorik (founder of postrank, which was recently bought by Google) came to: http://vimeo.com/22513786


Naive Bayes: the "good enough" classifier.


Looking at the video, you could interpret his statement two ways. Either, the headline - “When you have enough data, sometimes, you don’t have to be too clever” OR the sort-of-opposite - "AI has made so little progress that we don't anything much better than naive Bayes"


I'd say both are somewhat true.

A lot of early "progress" in AI was found to not survive contact with the real world -- for example, most of computer vision. This was because collecting data was so expensive/difficult that only a few images could be captured for many experiments, and the methods they came up with often worked okay for those examples, but nothing else! So a lot of clever-seeming algorithms end up being rather useless in the real world, and progress was illusionary.

I find that in computer vision (my area of research), a fundamental component of many disparate problems is that you are trying to interpolate or extrapolate data in a very complicated underlying space, where linear approximations are completely unusable and optimization is too unconstrained. The key is to come up with suitable regularizers that can use prior information to constrain the problem appropriately.

Getting more data thus helps in two ways:

1. It reduces the amount of interpolation you have to do, since you can get a denser sampling of the space.

2. It allows you up to build up these priors using real data, making interpolation much better.


I find the latter to be a weird claim; there's very strong theoretical and empirical evidence that Naive Bayes is significantly worse than other algorithms that can model cross-feature correlations (including really dumb linear regressions).

Empirically, this paper (http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icm...) makes a reasonably compelling case that Naive Bayes is really not very good compared to anything that actually models cross-feature correlations. Theoretically, it's clear that Naive Bayes will fail in unboundedly bad ways given enough strongly correlated features (If I just duplicate the feature N times, I effectively multiply its coefficient by N without actually adding any new information).

Note: I believe there is a technical weakness in the paper due to how they quantized continuous variables for use in Naive Bayes, but the overall performance trends reported confirm my experience with modeling projects in the wild.

Edit: I realize that one might make the claim just in the context of huge data sets, but again you have to get lucky not to have strong correlation effects that other models would handle better.

Edit 2: Oh, I'm an idiot. You specifically said AI. I'll leave the comment as it was, because I often hear the "with enough data Naive Bayes is as good as anything else" story and hope to influence anyone who might be impressionable :-)


You could, but this is Norvig so everyone who has read his previous stuff on big data knows immediately the former interpretation is meant.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: