You can probably finetune a 13b one with that. Try these scripts: https://github...

meghan_rain · on March 22, 2023

what if i want to finetune with long documents? say AI papers that are ~10 pages long on average? how would they be tokenized given that max_seq_length is 512?

amrb · on March 22, 2023

Split your training data into chucks of text that make sense. A random dataset example https://huggingface.co/datasets/imdb

meghan_rain · on March 22, 2023

Thanks, what does making sense mean? Be logically coherent (eg a paragraph of text in a document?)

And does the training then create windows of ngrams on those chunks? Or what is the input/output?

The reason I ask: If I had question/answer pairs, the question is the input, the answer is the output.

What is the "output" when the input is just a (logically coherent) chunk of text?

lxe · on March 22, 2023

> What is the "output" when the input is just a (logically coherent) chunk of text?

It probably won't change much if it's just a single sample. If you put in a large corpus of samples that repeat on the same theme, then the model will be "tuned" to repeat that theme. If you increase the number of epochs, you can overtrain it, meaning that it will just spit out the training data text.