It's quadratic, because of the dot product in the attention mechanism. You can u...

It's quadratic, because of the dot product in the attention mechanism.

You can use K-V Caching to get rid of a lot of the quadratic runtime that comes from redundant matrix multiplications, but after you have cached everything, you still need to calculate the dot product k_i * q_j with i,j being index of the tokens. With n tokens, you will get O(n*n).

But you have to remember that this is only n^2 multiplications. It's not exactly the end of the world at context sizes of 32k, for example. It only gets nasty in the hundred thousands to millions.

Here is the source I used: https://sebastianraschka.com/blog/2023/self-attention-from-s...