Looks like you need about 120 GB to fine-tune the 65B model with this code at a ...

albertzeyer · on March 22, 2023

The attention implementation is in a way that the memory scales quadratically with sequence lengths. Overall, this is still a small factor compared to just the model weights, but at some seq lengths, this would dominate.

By using flash attention, you can get memory requirement down to scale linearly with sequence lengths.

lxe · on March 22, 2023

Lots of VRAM, but the method is the same. Here's someone who finetuned llama-30b on alpaca dataset for example: https://github.com/deep-diver/Alpaca-LoRA-Serve

lxe · on March 22, 2023

> How does the memory usage scale as the sequence length grows?

That's a good question. I was under the assumption it's linearly proportional, but I can test it out I guess.

Taek · on March 22, 2023

I suspect its linear with a small constant factor.

joshxyz · on March 22, 2023

man i hope there are online calculators that lets people visualize the costs of training these things.

arpowers · on March 22, 2023

Really takes that much vram??

ioedward · on March 22, 2023

Normally people split up the model across multiple GPUs, i.e. model/tensor parallelism.