I think they are correct, do you have a source? From my knowledge the only other...

imtringued · 2024-02-22T12:55:02 1708606502

It's quadratic, because of the dot product in the attention mechanism.

You can use K-V Caching to get rid of a lot of the quadratic runtime that comes from redundant matrix multiplications, but after you have cached everything, you still need to calculate the dot product k_i * q_j with i,j being index of the tokens. With n tokens, you will get O(n*n).

But you have to remember that this is only n^2 multiplications. It's not exactly the end of the world at context sizes of 32k, for example. It only gets nasty in the hundred thousands to millions.

Here is the source I used: https://sebastianraschka.com/blog/2023/self-attention-from-s...

chillee · 2024-02-23T01:01:38 1708650098

https://news.ycombinator.com/item?id=39475528

My source is me :) I work at PyTorch on ML compilers.

If you don't believe me, perhaps you'll believe Karpathy's diagram (and the general discussion in the thread): https://twitter.com/karpathy/status/1658161721251602432

lumost · 2024-02-22T04:09:51 1708574991

For small values of N, the linear terms of the transformer dominate. At the end of the day, a double layer of 764*2048 is still north of 3.1 MM flops/token/layer.