Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Looks like you need about 120 GB to fine-tune the 65B model with this code at a sequence length of 512. How does the memory usage scale as the sequence length grows?


The attention implementation is in a way that the memory scales quadratically with sequence lengths. Overall, this is still a small factor compared to just the model weights, but at some seq lengths, this would dominate.

By using flash attention, you can get memory requirement down to scale linearly with sequence lengths.


Lots of VRAM, but the method is the same. Here's someone who finetuned llama-30b on alpaca dataset for example: https://github.com/deep-diver/Alpaca-LoRA-Serve


> How does the memory usage scale as the sequence length grows?

That's a good question. I was under the assumption it's linearly proportional, but I can test it out I guess.


I suspect its linear with a small constant factor.


man i hope there are online calculators that lets people visualize the costs of training these things.


Really takes that much vram??


Normally people split up the model across multiple GPUs, i.e. model/tensor parallelism.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: