Hacker Newsnew | past | comments | ask | show | jobs | submit | jfrankle's commentslogin

whyyy


Honestly, just a matter of having the time to clean everything up and get it out. The ancillary code, model cards, etc. take a surprising amount of time.


It's an indictment of the A100 node that died on us yesterday, leaving us with 248 GPUs in the particular cluster where we were running the experiments :(

It turns out that, in these kinds of large-scale experiments, hardware failures are a constant fact of life, and we have tools to manage these hardware failures and allow runs to continue anyway.

Unfortunately, it would mess up our throughput calculations for getting clean baselines here, so we're waiting for our cloud provider to kindly replace the bad A100. Expect those numbers in the next day or so.


Getting reliable GPUs is a difficult problem, I empathize. I've spent a decent amount of time and money because there was one failing GPU on an AWS cluster.


We've come to accept that it's an impossible problem at this point. Instead, we're getting good at automatically detecting hardware failures and rapidly restarting runs on fewer nodes. We're also exploring batch sizes that are (where possible) divisible by N nodes and N-1 nodes. Fault tolerant system design is unfortunately an evergreen topic in CS.


This is an illuminating (and notably rigorous) read for anyone interested in neural network sparsity and compression. But - equally importantly - it's a valuable read for anyone interested in the replicability of neural network research in general. The authors make clear the urgent need to evaluate research (and reevaluate received wisdom) on networks of the scale and complexity used in practice. I hope this paper will spark some important conversations in the community about our standards for assessing new ideas (mine included). As this paper makes exceedingly clear, plenty of techniques and behaviors for MNIST and CIFAR10 manifest differently (if at all) in industrial-scale settings.

My biggest question coming out of this work was as follows: which small scale (or - at the very least - inexpensive) benchmarks share enough properties in common with these large scale networks that we should expect results to scale with reasonable fidelity? Resnet50 is still far too slow and expensive to use as a day-to-day research network in academia, let alone transformer. Personally, I've found resnet18 on CIFAR10 to pretty reliably predict behavior on resnet50 on imagenet, but that's anecdotal. For the academics who can't drop hundreds of thousands of dollars (or more) on each paper but still want to contribute to research progress, we should carefully assess (or design) benchmarks with this property in mind.

(With respect to the lottery ticket hypothesis, we have a complimentary ICML submission about its behavior on large-scale networks coming shortly!)


I think the goal should be use the smallest dense network possible as the baseline. For MNIST, this might be a LeNet style convnet with [3, 9, 50] instead of the [20, 50, 500] network which is standard (and way overkill).

I haven't explored on CIFAR, but my guess is that using a more efficient architecture like mobilenetv2 would yield more likely to transfer results.

The general theme is that you should be using the smallest dense model you possibly can as a baseline.


(Author of paper here) This is approximately my growing suspicion.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: