Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can probably finetune a 13b one with that. Try these scripts: https://github.com/zphang/minimal-llama/#minimal-llama


what if i want to finetune with long documents? say AI papers that are ~10 pages long on average? how would they be tokenized given that max_seq_length is 512?


Split your training data into chucks of text that make sense. A random dataset example https://huggingface.co/datasets/imdb


Thanks, what does making sense mean? Be logically coherent (eg a paragraph of text in a document?)

And does the training then create windows of ngrams on those chunks? Or what is the input/output?

The reason I ask: If I had question/answer pairs, the question is the input, the answer is the output.

What is the "output" when the input is just a (logically coherent) chunk of text?


> What is the "output" when the input is just a (logically coherent) chunk of text?

It probably won't change much if it's just a single sample. If you put in a large corpus of samples that repeat on the same theme, then the model will be "tuned" to repeat that theme. If you increase the number of epochs, you can overtrain it, meaning that it will just spit out the training data text.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: