Show HN: Part 4: Train CIFAR10 to 94% in under 8 seconds on a single A100

Hello everyone,

It's been two weeks since we moved under the 10 second mark, and as we've made some more progress with (some very hard) work on the issue in the past two weeks, and we've past our internal 8 second benchmark for another release, we're releasing this next update!

This update changes the neural network architecture to our own new, custom 8-layer ResNet architecture (dubbed SpeedyResNet) which is extremely simple and fast. We also do some hyperparameter tuning, round the hyperparameters to rounder numbers than they were before, and also change up the learning process a bit by changing how we use our EMA. We also do this by only adding 2 (or 3, depending upon what you're counting) lines of new code! The vast majority of the rest of the work is editing, changing, or simply outright deleting other code. This results in a codebase that is a bit simpler (at least in layout) and faster than before. We also eliminate a hyperparameter that seems to be no longer useful. One downside of these changes is that we do overfit slightly more on longer runs, but that can be mitigated enough with cutout, and we hope to fix this in future releases, as this is not a terrible problem to have when trying to set speed records.

We test our code on CIFAR100 without any modifications (other than to the dataloaders to load the correct number of data and the correct number of classes) and show that performance for those two different network sizes is comparable. To do this, we show (at least in rough initial explorations), that both of the small networks matched the performance of SOTA networks in around the same year, and that increasing the base depth of the network by a factor of two improved the performance of both networks by about a year to match respective SOTAs from the same time period. This indicates that this code (hopefully) has some good generalization capabilities beyond just this dataset, though we have not experimented with different image sizes yet (it's rather expensive and the information might get stale very quickly!)

There's a lot more in here, but as in previous posts, the mantra of sorts is 'doing the basics and doing them very well'. This goes a whole lot further than it might seem otherwise, having 'the new shiny' when developing neural networks is oftentimes more of a toy and a distraction than sticking with the basics and doing them well. Which is understandably a very difficult thing. That said, as we run out of runway for the 'easier' changes, we will likely need to get more and more creative.

But until then, the goal is to stay as simple as possible! If you'd like more info, please do read the release notes, they are very informative although longer. Future releases could have more to do with speed improvements or other things.

Additionally, this still should be an excellent researcher's toolbench for prototyping and experimenting with ideas. Many ideas I've been able to implement in 5 minutes or less, most of which are actually running in the code by that point. For some, like sometimes architecture changes, for example, I'm able to get a complete initial go/no go filter answer oftentimes within 1-2 minutes of the idea. I just quickly tweak it, let it run through a few runs, and either have to let it run more runs to see if it is just a noisy answer or if it definitely doesn't work. This is indispensable and part of why I built this tool. It's also partially responsible for the rapid progress in developing this tool -- I'm able to apply the very-rapidly-gained insights from this tool to itself.

I'll be hanging around here for questions/comments/etc. I can't answer all of them, but I'll do the best that I can! :D