I think they are correct, do you have a source? From my knowledge the only other components are the fully connected networks which are not big contributors.
It's quadratic, because of the dot product in the attention mechanism.
You can use K-V Caching to get rid of a lot of the quadratic runtime that comes from redundant matrix multiplications, but after you have cached everything, you still need to calculate the dot product k_i * q_j with i,j being index of the tokens. With n tokens, you will get O(n*n).
But you have to remember that this is only n^2 multiplications. It's not exactly the end of the world at context sizes of 32k, for example. It only gets nasty in the hundred thousands to millions.
For small values of N, the linear terms of the transformer dominate. At the end of the day, a double layer of 764*2048 is still north of 3.1 MM flops/token/layer.