none of these techniques except MLA are new

eldenring · 2025-01-28T18:28:29 1738088909

They're not new in the same way Attention wasn't new when the transformer paper was written.

No one (publically) had really pushed any of these techniques far, especially not for such a big run.

whimsicalism · 2025-01-28T18:35:19 1738089319

no one publicly pushes any techniques very far except for meta and it’s true they continue to train dense models for whatever reason.

the transformer was an entirely new architecture, very different step change than this

e: and alibaba

leetharris · 2025-01-28T19:13:56 1738091636

They likely continue to train dense models because they are far easier to fine tune and this is a huge use case for the Llama models

whimsicalism · 2025-01-28T19:17:13 1738091833

It probably also has to do with their internal infra. If it were just about dense models being easier for the OSS community to use & build on, they should probably be training MoEs and then distilling to dense.

bilbo0s · 2025-01-28T19:16:01 1738091761

There's new stuff on lower layers. Some of the math is, interesting? A novel method of scaling mantissas and exponents. Yes, some of the operations have to use higher precision. Yes, some values like optimizer states, gradients and weights still need higher precision. But what they can do in 8 they do in 8. Of course, like everyone, they're reduced to begging NVidia to quantize on global to shared transfer in order to realize the true potential of what they're trying to do. But I mean, hey, that's where we all are and most papers I read don't have near as many interesting and novel techniques in them.

I think recomputing MLA and RMS on backprop is something few would have done.

Dispensing with tensor parallelism by kind of overlapping forward and backprop. That would not have been intuitive to me. (I do, however, make room for the possibility that I'm just not terribly good at this anymore.)

I don't know? I just think there's a lot of new takes in there.

whimsicalism · 2025-01-28T19:19:25 1738091965

i think some are reading my comment as critical of deepseek, but i'm more trying to say it is an infrastructural/engineering feat moreso than an architectural innovation. this article doesn't even mention fp8. these have been by far the most interesting technical reports i've read in a while

cma · 2025-01-28T18:41:47 1738089707

Flash attention was also a set of common techniques in other areas of optimized software, yet the big guys weren't doing the optimizations when it came out and it significantly improved everything.

whimsicalism · 2025-01-28T18:48:58 1738090138

yes, i agree that low-level & infra work is where a lot of deepseek's improvement came from

WithinReason · 2025-01-28T18:17:04 1738088224

There is a big difference between inventing a technique and productising it.

anonymousDan · 2025-01-28T18:25:15 1738088715

One issue is that a lot of techniques proposed (especially from academic research) are hard to validate at scale given the resources required. At least DeepSeek helps a little in that regard.