The attention implementation is in a way that the memory scales quadratically with sequence lengths. Overall, this is still a small factor compared to just the model weights, but at some seq lengths, this would dominate.
By using flash attention, you can get memory requirement down to scale linearly with sequence lengths.
By using flash attention, you can get memory requirement down to scale linearly with sequence lengths.