Hacker News new | past | comments | ask | show | jobs | submit login

I think they are correct, do you have a source? From my knowledge the only other components are the fully connected networks which are not big contributors.



It's quadratic, because of the dot product in the attention mechanism.

You can use K-V Caching to get rid of a lot of the quadratic runtime that comes from redundant matrix multiplications, but after you have cached everything, you still need to calculate the dot product k_i * q_j with i,j being index of the tokens. With n tokens, you will get O(n*n).

But you have to remember that this is only n^2 multiplications. It's not exactly the end of the world at context sizes of 32k, for example. It only gets nasty in the hundred thousands to millions.

Here is the source I used: https://sebastianraschka.com/blog/2023/self-attention-from-s...


https://news.ycombinator.com/item?id=39475528

My source is me :) I work at PyTorch on ML compilers.

If you don't believe me, perhaps you'll believe Karpathy's diagram (and the general discussion in the thread): https://twitter.com/karpathy/status/1658161721251602432


For small values of N, the linear terms of the transformer dominate. At the end of the day, a double layer of 764*2048 is still north of 3.1 MM flops/token/layer.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: