The original flash attention (v1?) took like a year to get added to llama.cpp an... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

moffkalast 9 months ago | parent | context | favorite | on: FlashAttention-3: Fast and Accurate Attention with...

The original flash attention (v1?) took like a year to get added to llama.cpp and only provides single digit percent VRAM savings for typical context lengths and practically no speed boost. Still nice to have, but man was this thing overhyped. I doubt v3 will do more than marginally better on the RTX 5000 series.

apsec112 9 months ago [–]

On GPU, or on CPU/Metal? For the latter I'm not surprised, but that's because they have a totally different memory/cache hierarchy.

moffkalast 9 months ago | [–]

With CUDA offloading, I don't think it runs otherwise at all.

Consider applying for YC's Summer 2025 batch! Applications are open till May 13
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact