Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Finetune LLaMA-7B on commodity GPUs using your own text (github.com/lxe)
449 points by lxe on March 22, 2023 | hide | past | favorite | 98 comments
I've been playing around with https://github.com/zphang/minimal-llama/ and https://github.com/tloen/alpaca-lora/blob/main/finetune.py, and wanted to create a simple UI where you can just paste text, tweak the parameters, and finetune the model quickly using a modern GPU.

To prepare the data, simply separate your text with two blank lines.

There's an inference tab, so you can test how the tuned model behaves.

This is my first foray into the world of LLM finetuning, Python, Torch, Transformers, LoRA, PEFT, and Gradio.

Enjoy!



Is there any library that allows you to train with a Mac M1/M2? I know it will be slower but I rather spend money on a Mac Studio rather multiple graphic cards to get around the VRAM limitation.


For training you could rent a GPU server for a short period.


Data will transmit to a third party, which may be a problem in some usecases


Or use something like google vertex to run a docker job on gpus


Personally I would recommend Colab or another notebook environment like a Lambda machine.

Much cheaper, and simpler than a bare metal machine, data ingress/egress is hard though for Colab you can just mount a Gdrive.

Unfortunately the API for training on M* chips (via MPS) is apparently still extremely buggy, so we have a ways to go before that is fully mainstream. And yes, I know that PyTorch just mainlined their mps support last week too...but from what I've heard the low level interface itself still needs some work. D:


Just fyi colab is way way more expensive now after the “credits” update than it was a year ago. Lamdalabs is about as cheap at this point


If you're fine with using other individuals machines (meaning, you don't care about data privacy for the training set), using vast.ai is probably the cheapest way to do it today. But, quality of machines/network speed vary greatly as the machines are hosted by individuals around the world.


I've used it for a long while after the credits update and it's been cheaper for me for a few reasons than Lambdalabs, one reason being the spinup/spindown times and being able to switch to a very cheap instance for prototyping. Lambdalabs has a crazy spinup time (comparatively), something like 3-5 minutes or so D: I find them good for the more mega runs where I don't have to touch my instance for several hours and it's worth the notebook porting time/cost. :D

One of those usecases where raw cost doesn't always translate into money saved, at least in my personal experience. :D

It's also nice that you can more easily edit in browser in colab before launching an instance, so you can do development->debugging->training via a single webpage without it breaking the bank or having huge privacy concerns. I think a close second would be Jupyter on vscode w/ a lambda backend but latency + fragmented architecture can add an extra step of complexity (which does matter! D:)


So hard I’ve been using lambda labs, vast.ai, and runpod to rent machines. A 3090 is about 30 cents an hour.


Vertex has a training service, rather than spinning up a notebook. If you can dockerize the training job you just need to upload the container and data to gcs then it’s point and click to run once - assume it’s some sort of kubeflow or whatever in the background.


That sounds nice for the more involved big runs. I'll try to take a look at it sometime. I have some training runs that loop in just under 7 seconds at the shortest for the actual training portion of the run (maybe 9-10 including notebook spinup). After doing this for so many years, I've found that keeping it as short as humanly possible is usually the way to go. :D

That said, I can definitely take a look at it for scale once I need to go over something like 1-2 hours maybe! :D


Yep, definitely best to do somewhere else if you can and just need to iterate, but it’s a good option when you have to do a big one and eat some real GPU time.


Renting is always less economical than owning. You can always sell your Apple Studio and still pay less.


Perhaps on a per unit basis. But there's a reason I buy steaks and not cows.


Is there a consensus yet on when you should fine-tune vs when you should use prompt engineering?

It seems not that hard to include in your prompt “here is some examples, please write like this and follow the style set here”. While that may make every completion request more expensive (more tokens), it also seems like fine tuning these models can also be quite expensive.

I’m curious if there are other trade offs besides cost— maybe quality achieved is better with fine tuning? Very interested to see how it all plays out. On the one hand a massive model like gpt4 can probably be prompted to match any style quite well, albeit costly, vs fine tuning a cheap model may get exactly what you want.


Empirically, while the ability for LLMs to zero shot learn is impressive, it’s significantly worse than fine tuning. An obvious example is LLaMA itself, from which it’s quite hard to get useful instructional behavior out, and requires a significant amount of prompt engineering, and is still brittle at that.

Fine tuning it in just 52k examples (alpaca) makes a night and day difference in usability for instruction following.


If there's context limitations (your document cant' fit in the context, or your session quickly loses the context), fine-tuning is a great option.


Even GPT-4 can only handle a few pages of text as prompt for examples. In most cases you'd want to fine tune.


I guess I’m just curious what the killer use-cases are for fine tuning. For example it seems like overkill to fine tune a Shakespeare model, because you can just say “write like Shakespeare” and it already knows what you want.


I guess you'd want to fine tune for content that wasn't already parsed before


to my understanding there are 4 levels to add information:

1. train a model

2. fine tune a model

3. create embeddings for a model

4. use few shot prompt examples at inference time

These have decreasing resource need, but also decreasing quality.

For example, the GPT-3 API (not yet the GPT-4 API) has a functionality to send it your own embeddings, for example of your own source code documentation. Then you can query GPT-3 and it "knows" your source code doc and answers specifically with that in mind.


Where in the API docs is this described?


Now, use this library to "bootstrapp the smarts of LLaMA from its own smartness" like this:

1. Ask it things. Let it answer.

2. Ask it to find errors in the answer it outputted and for it to correct the answer.

3. Use the original prompt and the corrected output as training data.

This should, with each iteration make the model less and less likely to output statements that are self contradictions or obviously wrong, until the model can no longer spot its own faults.


I recall reading that when training AlphaZero they would start pitching it against itself doing millions of games in a few days, which worked great because there is an external metric (who wins the chess game) that would objectively be a good measure to train towards.

But if you let an AI's approval be the metric, things turn a lot more fussy and subjective. The goal is not actually "to write a good answer without error" but actually "to write an answer that is approved by the AI". Those are very different goals, and as you keep using it you'll get a bigger and bigger divergence, until eventually the AI is just answering complete garbage nonsense that precisely hits certain sweet spots in the grading AI.

This divergence of the target vs the actual human goal is a pretty interesting problem in AI safety research. I love the example where an AI trained to stay alive as long as possible in Tetris realized that pausing the game was the best strategy.


You’re describing a GAN basically.

But yeah, you’re going to need an objective metric or human input otherwise the system is going to diverge in strange ways.


I honestly think I might do this experiment, just to see what comes out. I know it will be utter garbage, but it will probably be interesting utter garbage.


Please do :)

The correction prompt is very important, it will definitely determine the outcome of the process, a bad correction prompt will obviously lead to a garbage result.

Training in steps with different prompts might be of value. First step might be to fix contradictions, then factual errors if that is an issue. This is an idea that I got when viewing the he output of LLaMA, it often contains contradictions (eg. an example I have seen is "Peter is a boy and he is part of the Gama sorority"). Asking it to fix those types of issues should be a first good step.

But I suspect that this type of training would need to be mixed with original training data. Otherwise the restructuring in the model caused by the new training would most likely garble the rest of the model.


That wasn't an AI, that was a "Make the numbers go up" (lexagraphic ordering) system with TAS rewinding for short term bruteforcing.


Interesting, but the core point remains true. The algorithm optimises for something which may not entirely coincide with the creators intentions.


For those skeptical of the above comment, this technique absolutely works and powers production-grade models like Anthropic’s Claude. There’s plenty of literature on this, but here are a couple papers that might be helpful for people doing their own training: - Constitutional AI: by Anthropic, an “RLAIF” technique that creates the preference model for “finding errors” based on a set of around 70 “principles” the AI uses to check its own output, not human feedback like in ChatGPT. This technique taught the Claude bot to avoid harmful output with few to no manual harmfulness labels! https://arxiv.org/abs/2212.08073. Not sure if there’s a HuggingFace implementation with LoRA / PEFT yet like there is for regular RLHF, so somebody may need to implement this for Llama still

- Self-Instruct: Creates artificial training data on instruction tuning from an untuned base model, from a tiny seed of prompts, and filters out the bad ones before fine-tuning. Manages to approach Instruct-GPT performance with only ~100 human labels. https://arxiv.org/abs/2212.10560


Or it will twist itself into a giant hairball of contorted logic, like GPT3.5 does when I (a human) encourage it to explain its errors.


You should try using a larger model like llama-35b or even GPT-3 for the feedback. That way you might be able to condense knowledge from these really big models into a smaller model


This is a cool idea in theory and I think could be useful in certain kinds of circumstances, but this particular instantiation would likely go into a bad bias spiral.

This is somewhat similar to how GANs try to learn the density of the underlying data, but here you do not have the underlying data as a reference, if that makes sense. It's sort of like filling a mattress with helium instead of air. Sure, the mattress will be lighter, but that does not mean you will float on it, if that makes any sense at all.

Hope that helps as a cogent answer to this question.


This is awesome! I noticed it said a prereq is >16GB VRAM — is that >= 16GB, or is it really explicitly greater than 16? Would be sweet to be able to finetune locally on, say, a 3080.


I gave this a try and it seems to max out using about 12GB of VRAM on a RTX 3090 Ti.


Tried the 30b-hf set too, but was too much (24GB available). 13b-hf works fine, maxing out at 17GB.


Looks like you need about 120 GB to fine-tune the 65B model with this code at a sequence length of 512. How does the memory usage scale as the sequence length grows?


The attention implementation is in a way that the memory scales quadratically with sequence lengths. Overall, this is still a small factor compared to just the model weights, but at some seq lengths, this would dominate.

By using flash attention, you can get memory requirement down to scale linearly with sequence lengths.


Lots of VRAM, but the method is the same. Here's someone who finetuned llama-30b on alpaca dataset for example: https://github.com/deep-diver/Alpaca-LoRA-Serve


> How does the memory usage scale as the sequence length grows?

That's a good question. I was under the assumption it's linearly proportional, but I can test it out I guess.


I suspect its linear with a small constant factor.


man i hope there are online calculators that lets people visualize the costs of training these things.


Really takes that much vram??


Normally people split up the model across multiple GPUs, i.e. model/tensor parallelism.


Slightly tangential, is there some kind of crowdsourced effort to build training data for fine tuning? Alpaca used the training data built from gpt-3.5, so there are terms of use restrictions



This looks good. Is the training data in the repo itself?


IANAL, but wouldnt feeding the training data into alpaca and then having it output similar text not construe new data that is not copyrighted by stanford/openai/facebook? (It would be a significantly creative and novel worm to get the prompts working correctly..) Obviously you are also not bound by the openai terms of use, and Im not sure if stanford's terms of use are as broad and well-defined...


The terms of ChatGPT/GPT-3.5 explicitly state that one cannot use their data to construct competitive models


Not enforceable. All they can do is ban your ChatGPT account.


I'm not a lawyer but if a successful company was built off chatgpt's output and without having a contract in place. I could see the totally morale megaCorp's trying to legally take an ownership stake.

Even recently US copywrite office have asked you too list any parts built with AI as we don't have the laws in place to cover this:

https://www.copyright.gov/ai/


> Not enforcable

Are you willing to assign a upper limit on this probability and bet for it?


OpenAI disclaims ownership interest in the model output. If a subscriber (who has a contractual relationship with OpenAI) chooses to generate outputs that could be used to train a competing model, and chooses to share those outputs with third parties, that is not prohibited by the agreement. Further, the data being shared belongs to the subscriber and can be licensed however they desire (though actually, model outputs may not be copyrightable at all). If a third party who does not have a contractual relationship with OpenAI chooses to take this data and train a competing model, they are using the data under valid license and have breached no obligation to OpenAI.


Do you have the money to enforce such (dis)claims? That's really what it comes down to, you might be right but you'd go bankrupt in the process. It's simply better to just use something that didn't touch their outputs in the first place, like OpenAssistant.


Yes, but 1) you would not be using their data, and 2) you are not bound by their ToS if you never signed up to their service right..


You can get datasets here, depending on the training you want to do: https://huggingface.co/datasets/


OpenChatKit is great tbh


I am really an AI noob. But lets say i tried the 7b model for translations purposes with below acceptable results. Can i train the model with 1 million of translated sentences to improve the quality of the translation output?


Yes you can finetune the model with reference output for any kind of language task. For translation, you are better off starting from a model specifically trained for that purpose such as facebook/nllb-200-distilled-1.3B. It will be faster and more accurate.


> For translation, you are better off starting from a model specifically trained for that purpose

Yes, but wasn't the whole point of the recent LLM research to show that you didn't need to fine-tine for a specific task?


Sure, but you will get better results at a smaller number of parameters from a specifically trained model right now if you are trying to train/host it yourself.

Remember that GPT3 is 175 billion parameters so many times bigger than both the above models (and gpt4 is rumoured to be bigger still), which also allows it to be more generalisable.

If GPT3 was trained at 7 billion parameters it might also lose it's language translation capabilities.


You could just use a much bigger model to perform arbitrary tasks without fine-tuning.


Do you know any guide on how to train these models? Any thoughts on the hardware requirements?


Can you give an example on how you prompted the model? Your issue is probably related to that, but I would need an example to be sure. I've found the 7b Alpaca model [1] to work surprisingly well! Here's how you're supposed to prompt it:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction: {instruction}

### Response:

or

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction: {instruction}

### Input: {input}

### Response:

[1] https://github.com/cocktailpeanut/dalai


Barebones llama is just a text completion model. You can give it a prompt like

    A conversation between a human and assistant

    Human: How old is the sun?
    Assistant:
And it will complete it.

Alpaca/Dalai are finetuned on a dataset that's formatted as this:

    ### Instruction: {instruction}

    ### Input: {input}

    ### Response:
So even without pre-prompting in this format it's going to be heavily biased towards performing completions in this format anyways.

It's always helpful to finetune on a preformatted prompt depending on what your task is.


What do the training screenshot examples try to accomplish? In what way would the model be different after the example fine-tuning?


Just added a screenshot of the inference UI. I just finished using it to tune on a subset of https://huggingface.co/datasets/Anthropic/hh-rlhf, and it seems to be working.


Same question here!


Great job OP. Yesterday after managing to run local instance of alpaca.cpp and reading more about what alpaca is and how it got fine tuned on LLaMA, I began to wonder what will it take to fine tune with own set of data.

With no real knowledge of LLM and only recently started to understand what LLM terms mean, such as 'model, inference, LLM model, intruction set, fine tuning' whatelse do you think is required to make a took like yours?

This is for education purposes and love to take a jab on creating something like this and and write an inference - such as the dev behind LLaMa inference in Rush.

>I am not familiar with HuggingFace libaries at all, why were they important in your implementaiton? > Gradio - I believe is the UI that allows to plugin different lLM models, I am familiar with text-generation-ui on GitHub that uses Gradio. >LORA I think further fines tines an model -- just like how LLaMa got fine tuned on instruciton set to produce Alpaca model.


> With no real knowledge of LLM and only recently started to understand what LLM terms mean, such as 'model, inference, LLM model, intruction set, fine tuning' whatelse do you think is required to make a took like yours?

This was mee a few weeks ago. I got interested in all this when FlexGen (https://github.com/FMInference/FlexGen) was announced, which allowed to run inference using OPT model on consumer hardware. I'm an avid user of Stable Diffusion, and I wanted to see if I can have an SD equivalent of ChatGPT.

Not understanding the details of hyperparameters or terminology, I basically asked ChatGPT to explain to me what these things are:

   Explain to someone who is a software engineer with limited knowledge of ML terms or linear algebra, what is "feed forward" and "self-attention" in the context of ML and large language models. Provide examples when possible.
I did the same with all the other terms I didn't understand, like "ADAM optimizer", "gradient", etc. I relied on it very heavily and cross-referenced the answers.

Looking at other people's code and just tinkering with things on my own really helped.

Through the FlexGen discord I've discovered https://github.com/oobabooga/text-generation-webui where I spent days just playing around with models. This got me into the huggingface ecosystem -- their transformers library is an easy way to get started. I joined a few other discords, like LLaMA Unofficial, RWKV, Eleuther AI, Together, Hivemind and Petals.

I bookmarked a bunch of resources but it's very sporadic. Here are some:

- https://github.com/zphang/minimal-llama/#peft-fine-tuning-wi...

- https://github.com/togethercomputer/OpenChatKit

- https://www.cstroik.com/index.php/2023/02/18/finetuning-an-a...

- https://github.com/huggingface/peft

- https://github.com/kingoflolz/mesh-transformer-jax/blob/mast...

- https://github.com/oobabooga/text-generation-webui

- https://github.com/hizkifw/WebChatRWKVstic

- https://github.com/ggerganov/whisper.cpp

- https://github.com/qwopqwop200/GPTQ-for-LLaMa

- https://github.com/oobabooga/text-generation-webui/issues/14...

- https://github.com/bigscience-workshop/petals

- https://github.com/alpa-projects/alpa


Very interesting project, thanks. I see that DeepSpeed is on your TODO list. I wonder what the biggest LLaMA model is that can be fine-tuned on 2x RTX 3090 & 128GB RAM.


You can probably finetune a 13b one with that. Try these scripts: https://github.com/zphang/minimal-llama/#minimal-llama


what if i want to finetune with long documents? say AI papers that are ~10 pages long on average? how would they be tokenized given that max_seq_length is 512?


Split your training data into chucks of text that make sense. A random dataset example https://huggingface.co/datasets/imdb


Thanks, what does making sense mean? Be logically coherent (eg a paragraph of text in a document?)

And does the training then create windows of ngrams on those chunks? Or what is the input/output?

The reason I ask: If I had question/answer pairs, the question is the input, the answer is the output.

What is the "output" when the input is just a (logically coherent) chunk of text?


> What is the "output" when the input is just a (logically coherent) chunk of text?

It probably won't change much if it's just a single sample. If you put in a large corpus of samples that repeat on the same theme, then the model will be "tuned" to repeat that theme. If you increase the number of epochs, you can overtrain it, meaning that it will just spit out the training data text.


hmm...I would think of PROLOG or other rule-based system as doing inference, but not of neural-network (which is essentially a mesh of matrix multiplications and functions).

this statement that NNs do inference is not entirely correct IMHO.


I think you are unfairly downvoted for this, probably because I have just as strong opinion in the other direction: I see GPT (especially -4) as a kind of "killer Prolog-style inference engine".

How does Prolog work? You:

* pass it some predicates * specify rules about what the relationships mean and how various things are computed from the data in predicates * query a variable * answer pops out.

How can GPT do this task? You:

* pass it some predicates (structured machine-readable syntax or natural-language sentences) * specify rules (in natural-language sentences, though it helps to iterate on the wording a bit to make the rules more rigid, and more likely to provide the correct output ~every time) - you don't normally need to specify relationships explicitly because GPT can usually figure it out * include some additional "massaging" wording to get it reproducibly outputting the kind of result you want * query a variable (tell it what to find out/infer from the data) * answer pops out (in human-readable language or structured syntax).

In some ways, they are very much alike. And GPT is much more natural to program with than Prolog.


They get there in completely different ways though. If there's no answer Prolog will fail out, and GPT* will usually make (stuff) up.


exactly. PROLOG fails with the non-true inference, and this is why it's inference. otherwise we'd be calling all the high-school made up nonsense inference.

it is absolutely not fair, established or not, to call non-linear regression inference. may be prognosis, maybe prediction, maybe just approximation. but inference is something that actually does logic, based on facts... and probabilistic theory infers only probability, not facts.


GPT is not being programmed, and so is not Prolog. Prolog is being set to follow rules, and infer logical consequences. Prolog is grammar, GPT3 is algebraic structure, which perhaps is also ... differential, so non-discrete.


Inference is a very established term in the field for a variety of things, not just the kind of inference you’re referring to.


Really interesting, thanks for sharing, I'm excited to try this out.

Would it also be possible to just train the model from scratch on commodity hardware and how big of a difference in training time would that be?


Can this be used to programmatically train the model ? I got a big website that I would like llama to chew on and be aware of the content there so it can answer questions.


Yes. Check `main.py`, rewrite it to load the text from whatever you want.


How does Llama compare to the actually open source Flan UL2?


I've built up a large personal library of hand made Anki flashcard decks over the years. This looks like just what I need to train a model on those decks.


Is it possible to fine-tune using CPU only?


I haven't tried it myself, and I haven't actually heard anyone attempt doing it with a model this large.


you could but it will be very slow. oh and make sure the code is cpu compatible.


my instance doesn't seem impressed:

So, I stumbled upon this Simple LLaMA FineTuner project by Aleksey Smolenchuk, claiming to be a beginner-friendly tool for fine-tuning the LLaMA-7B language model using the LoRA method via the PEFT library. It supposedly runs on a regular Colab Tesla T4 instance for smaller datasets and sample lengths.

The so-called "intuitive" UI lets users manage datasets, adjust parameters, and train/evaluate models. However, I can't help but question the actual value of such a tool. Is it just an attempt to dumb down the process for newcomers? Are there any plans to cater to more experienced users?

The guide provided is straightforward, but it feels like a solution in search of a problem. I'm skeptical about the impact this tool will have on NLP fine-tuning.


> I can't help but question the actual value of such a tool. Is it just an attempt to dumb down the process for newcomers?

Actually, you've hit the nail on the head here. I wanted something where I, a complete beginner, can quickly play around with data, parameters, finetune, iterate, without investing too much time.

That's also why I've annotated all the training parameters in the code and UI -- so beginners like me can understand what each slider does to their tuning and to their generation.


This is exactly the sweet spot I'm looking for. Technical enough that I can play around, simplified enough that I'm investing an hour or two of my time instead of a whole weekend.


I get that you /can/ use an LLM to generate troll feedback for random projects... but why?


I was just excited that I got it working at all :/


So you are annoyed that something targeted for beginners does not also cater to experts?


me? re-read that s'il vous plaît


maybe put the bit it said in quotes? I didn't read closely enough myself the first time, it took your subsequent comments to make me realize what you'd done




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: