The compute scheduling part of the paper is also vey good, the way they balanced load to keep compute and communication in check.
There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range.
None of the techniques by themselves are really mind blowing, but the whole of it is very well done.
When everyone kind of ignores performance because compute is cheap and speed will double anway in 18 months (note: hasn't been true for 15 years), the willingness to optimize is almost a secret weapon. The first 50% or so are usually not even difficult because there is so much low-hanging fruit, and in most environments there's a lot of helpful tooling to measure exactly which parts are slow.
Compute has been more than doubling because people have been spending silly money on it. How long ago would a proposal for a $10m cluster for ML have been thought surreal by any funding agency? Certainly less than 10 years ago. Now people are talking of spending billions and billions.
When people are talking about $100M-$1B frontier model training runs, then obviously efficiency matters!
Sure training cost will go down with time, but if you are only using 10% of the compute of your competition (TFA: DeepSeek vs LLaMa) then you could be saving 100's of millions per training run!
I was more stating the perception that compute is cheap than the fact that compute is cheap - often enough it isn't! But carelessness about performance happens, well, by default really.
At my org this is a crazy problem. Before I arrived, people would throw all kinds of compute at problems. They still do. When you've got AWS over there ready to gobble up whatever tasks you've got, and the org is willing to pay, things get really sloppy.
It's also a science-based organization like OpenAI. Very intelligent people, but they aren't programmers first.
I think the AI megacorps plan was always SaaS. Their focus was never on self-hosting, so optimization was useless: their customers would pay for unoptimized services whether they wanted or not.
Making AI practical for self-hosting was the real disruption of DeepSeek.
The secret is to basically use RL to create a model that will generate synthetic data. Then you use the synthetic dataset to fine-tune a pretrained model. The secret is basically synthetic data imo: https://medium.com/thoughts-on-machine-learning/the-laymans-...
There is also a lot of thought put into all the tiny bits of optimization to reduce memory usage, using FP8 effectively without significant loss of precision nor dynamic range.
None of the techniques by themselves are really mind blowing, but the whole of it is very well done.
The DeepSeekV3 paper is really a good read: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...