Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's new stuff on lower layers. Some of the math is, interesting? A novel method of scaling mantissas and exponents. Yes, some of the operations have to use higher precision. Yes, some values like optimizer states, gradients and weights still need higher precision. But what they can do in 8 they do in 8. Of course, like everyone, they're reduced to begging NVidia to quantize on global to shared transfer in order to realize the true potential of what they're trying to do. But I mean, hey, that's where we all are and most papers I read don't have near as many interesting and novel techniques in them.

I think recomputing MLA and RMS on backprop is something few would have done.

Dispensing with tensor parallelism by kind of overlapping forward and backprop. That would not have been intuitive to me. (I do, however, make room for the possibility that I'm just not terribly good at this anymore.)

I don't know? I just think there's a lot of new takes in there.



i think some are reading my comment as critical of deepseek, but i'm more trying to say it is an infrastructural/engineering feat moreso than an architectural innovation. this article doesn't even mention fp8. these have been by far the most interesting technical reports i've read in a while




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: