Looks like you need about 120 GB to fine-tune the 65B model with this code at a sequence length of 512. How does the memory usage scale as the sequence length grows?
The attention implementation is in a way that the memory scales quadratically with sequence lengths. Overall, this is still a small factor compared to just the model weights, but at some seq lengths, this would dominate.
By using flash attention, you can get memory requirement down to scale linearly with sequence lengths.