Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

none of these techniques except MLA are new


They're not new in the same way Attention wasn't new when the transformer paper was written.

No one (publically) had really pushed any of these techniques far, especially not for such a big run.


no one publicly pushes any techniques very far except for meta and it’s true they continue to train dense models for whatever reason.

the transformer was an entirely new architecture, very different step change than this

e: and alibaba


They likely continue to train dense models because they are far easier to fine tune and this is a huge use case for the Llama models


It probably also has to do with their internal infra. If it were just about dense models being easier for the OSS community to use & build on, they should probably be training MoEs and then distilling to dense.


There's new stuff on lower layers. Some of the math is, interesting? A novel method of scaling mantissas and exponents. Yes, some of the operations have to use higher precision. Yes, some values like optimizer states, gradients and weights still need higher precision. But what they can do in 8 they do in 8. Of course, like everyone, they're reduced to begging NVidia to quantize on global to shared transfer in order to realize the true potential of what they're trying to do. But I mean, hey, that's where we all are and most papers I read don't have near as many interesting and novel techniques in them.

I think recomputing MLA and RMS on backprop is something few would have done.

Dispensing with tensor parallelism by kind of overlapping forward and backprop. That would not have been intuitive to me. (I do, however, make room for the possibility that I'm just not terribly good at this anymore.)

I don't know? I just think there's a lot of new takes in there.


i think some are reading my comment as critical of deepseek, but i'm more trying to say it is an infrastructural/engineering feat moreso than an architectural innovation. this article doesn't even mention fp8. these have been by far the most interesting technical reports i've read in a while


Flash attention was also a set of common techniques in other areas of optimized software, yet the big guys weren't doing the optimizations when it came out and it significantly improved everything.


yes, i agree that low-level & infra work is where a lot of deepseek's improvement came from


There is a big difference between inventing a technique and productising it.


One issue is that a lot of techniques proposed (especially from academic research) are hard to validate at scale given the resources required. At least DeepSeek helps a little in that regard.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: