The datasets serve as benchmarks. You get an idea for a new model that solves a ...

sdenton4 · on April 16, 2020

It also potentially gives every paper N replication problems to solve, in addition to just the gpu time. I would have to figure out HOW to retrain all of these models on the current form of the dataset... Which is fine for an occasional explicit replication study, but terrible if everyone has to do it.

I think it's probably better to have a (say) yearly release of the dataset, with results of some benchmark models released alongside the new version.

This is similar to how Common Voice is handling the problem: it's a crowd sourced, constantly growing dataset, which is awesome if you want to train in as much as possible for production models. You can get the whole current version any time, but they also have releases with a static fileset and train/test split, which should be better for research.

barkingcat · on April 16, 2020

That's not wasteful. That's correction.

Is it wasteful to throw away a batch of food when 20% of it has been studied to contain the wrong substance, which ends up causing disease?

Isn't it even more wasteful to continue using unedited and unverified data sets just because all the previous models were trained on it, and thus we can no longer advance the state of the research? It's a case of garbage in garbage out.

lmkg · on April 16, 2020

The thing is, the value as a baseline doesn't actually change that much for being 20% garbage. A bit counter-intuitive, but basically accepted as true in several fields.

The comparisons are all relative accuracy, not absolute accuracy. And the comparison is fair. The new technique is receiving the same part-garbage input that the old-techniques were trained on. For the most part, the better technique will still tend to do better unless there's specifically something about it that makes it more sensitive to labeling errors.

And frankly, a percentage of junk has some advantages. Real-word data is a pile of ass, so it's useful for academic models to require robustness.

ethbro · on April 16, 2020

I thought SOTA was still a few % in difference?

It seems worrisome that they few percent might be making a coin flip right-randomly instead of wrong-randomly on a mislabelled subgroup of data...

6gvONxR4sf7o · on April 16, 2020

>By one estimate, the training time for AlphaGo cost $35 million [0]

How about XLNet which cost something like $30k-60k to train [1]? GPT-2 may have been around the same [2] is estimated around the same, while thankfully BERT only costs about $7k[3], unless of course you're going to do any new hyperparameter tuning on their models which you of course will do on your own model. Who cares about apples-to-apples comparisons?

We're not talking about spending an extra couple hours and a little money on updated replication. We're talking about an immediate overhead of tens to hundreds of thousands of dollars per new paper.

Tasks are updated over time already to take issues into account, but not continuously as far as I know.

[0] https://www.wired.com/story/deepminds-losses-future-artifici...

[1] https://twitter.com/jekbradbury/status/1143397614093651969

[2] https://news.ycombinator.com/item?id=19402666

[3] https://syncedreview.com/2019/06/27/the-staggering-cost-of-t...

visarga · on April 16, 2020

BERT is trained on unsupervised data. It's not the same kind of model the article talks about.

barkingcat · on April 16, 2020

Yah it is by no means wasteful for AlphaGo to throw away all their training data and then re-train itself!

That kind of ruthless experimentation is how AlphaGo was able to exceed even itself. The willingness to say - all these human games we've fed the computer? All these terabytes of data? It's all meaningless! We're going to throw it all away! We will have AlphaGo determine what is good by playing games against itself!

And I bet you that for the next iteration of AlphaGo, the creators of this system will again, delete their own data and retrain when they have a better approach.

If you don't "waste" your existing datasets (once you reallze the flaws in your data sets), you are being held back by the sunk cost principle. You only have yourself to blame when someone does train for the exact same purposes, but with cleaner data.

The person who has the cleanest source of training data will win in deep learning.

You're sabotaging yourself in my opinion. 30k is nothing when you're just sabotaging the training with faulty data.

p1esk · on April 16, 2020

I'm actually glad it costs so much to train these models. Great incentive to find more efficient algorithms. That's how biological brains evolved.

third_I · on April 16, 2020

As an investor, $35m to train just about the pinnacle of AI seems like a cheap, oh so cheap cost. I can't even buy 1 freaking continental jet for that ticket, and there are thousands of these babies flying (not as we speak, but generally).

I don't think you are fully cognizant yet with the formidable scale of AI in the grander scheme of things, as an industry, which is nowadays comparable to transistors circa 1972 in terms of maturity. Long, long ways to go before we sit on "reference" anything. Whether architectures, protocols, models, test standards, it's a Far West as we speak.

You make excellent points in principle, which are important to keep in mind in guiding us all along the way, but now is not the time to set things in stone. More like the opposite.

The matter of the fact is that someone will eventually grab the old and new benchmarks, prove superiority in both, and by that point the new is the one to beat since it would be presumably error-free this time.

lopmotr · on April 16, 2020

The dataset is a controlled variable in an experiment so it has to be held constant. If you update your model and the dataset for every trial (eg new hyperparameters or new architecture), and find it performs better, you won't know if the model is really better or just the dataset.

andreyk · on April 17, 2020

This is accurate, but it's also worth noting the AI community has been moving to new benchmarks all the time (eg SQUAD 2.0 came out like a year after SQUAD). So in effect editing does happen all the time, just in a batch way instead of continuous wiki type way. This blog post deals with "VOC 2012, COCO 2017, Self Driving Car Udacity" which seem like pretty old datasets no longer really in use. There were news stories about the self driving car dataset actually , so the knowledge it has issues is not even new. Not to say this is not really useful, but would be nice to note...

6gvONxR4sf7o · on April 17, 2020

Also worth noting that many of those moves were because the original was found to be too easy, like in SQuAD's case, rather than being too noisy.

kalium-xyz · on April 17, 2020

Why not just version them?

hatmatrix · on April 16, 2020

But you can have version numbers like with code and models.

roosterdawn · on April 16, 2020

What you're saying is that it's worth it to lie because it's too expensive to give a truthful answer. That is something that your customers likely would not agree with.