Clarifying: Given the question: "How much is the flash attention algorithm tied ...

kristjansson · 2024-07-11T22:06:43 1720735603

I think the right analogy for FA is high-quality cache-aware BLAS kernel implementations. The algorithm(s) is (are) clever and (as you note) completely independent of hardware. However, a hardware-naive implementation is approximately worthless. Most of the value of MKL, or Accelerate, or FA is in the careful matching of the parameters and implementation of the algorithm to the capabilities of hardware it's going run on.

I definitely don't mean to take away from Tri/FA by mentioning novelty - I'm just repeating from paper, which refers back to algebraic aggregates[0] in its discussion of their tiled softmax.

[0]: https://web.stanford.edu/class/cs345d-01/rl/olap.pdf

kiratp · 2024-07-12T04:47:44 1720759664

> However, a hardware-naive implementation is approximately worthless.

This isn’t true when there is one vendor that’s 90% of the market and 2 maybe 3 generations of hardware to consider. Support A100, H100 and you are supporting most of the current market.

saagarjha · 2024-07-12T06:49:47 1720766987

Supporting A100 and H100 is the opposite of being hardware naive, though.