Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The attention implementation is in a way that the memory scales quadratically with sequence lengths. Overall, this is still a small factor compared to just the model weights, but at some seq lengths, this would dominate.

By using flash attention, you can get memory requirement down to scale linearly with sequence lengths.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: