Fine-Tuning Mistral7B on Python Code

visarga · on Oct 9, 2023

Yesterday I summarised 15 years of my web comments with Mistral for RAG augmentation, took a few hours. Great little model I can run even on a 2080ti. As small as it is, it understands English and executes instructions fluently.

tarruda · on Oct 9, 2023

For chat, I suggest trying the OpenOrca finetune, the performance is amazing and it responds very well to system prompt: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca

It runs very well on my laptop's RTX 3070 which only has 8gb ram, felt like I'm running early versions of chatgpt locally.

iamcreasy · on Oct 9, 2023

It would be nice if they shared some before and after result because I did some spot checks and I get the same output(using default Mistral) found in the python_code_instructions_18k_alpaca dataset.

Setting that aside, is there a guide on how to create good training dataset for fine tuning models like Mistral?

tarruda · on Oct 9, 2023

I'd say the original Alpaca paper is a good source of inspiration on how to create datasets, they even shared the script used to generate data using OpenAI API.

One thing that came to mind for creating datasets for other programming languages would be to start with this Python dataset and use GPT-4 to convert to equivalents in other languages. You can even automatically test each generates example and ask GPT-4 to correct any errors

rahimnathwani · on Oct 9, 2023

ctrl+F for Baseline and you'll see a part where they show the two models' outputs with the same prompt.

I didn't look closely to see whether this is deterministic with the right temperature and top_p or whatever.

iamcreasy · on Oct 9, 2023

Uh, missed that. Thank you! What do you referring to by temperature?

rahimnathwani · on Oct 9, 2023

Due to the default settings of `top_k`, `top_p`, and `temperature`, which introduce a probabilistic element to token selection, two runs of the same prompt with the same model may yield different outputs.

The first two parameters, `top_k` and `top_p`, define the subset of tokens eligible for selection, while `temperature` influences the probability distribution from which tokens are sampled.

jasonjmcghee · on Oct 9, 2023

Have you (or others!) done much research on using different subsets of target modules? Just doing q, k, v, o is dramatically less ram and much more accessible (mistral 7b on 3080ti)

tarruda · on Oct 9, 2023

I haven't, but didn't write the article. I don't understand what these modules mean for fine-tuning

Here someone does Alpaca LoRA fine-tuning on Llama 7b using only q/v modules: https://colab.research.google.com/drive/1rqWABmz2ZfolJOdoy6T...

jasonjmcghee · on Oct 9, 2023

They are different parts of the network. If you don't fine tune them, they don't change. They are "frozen"- if you're familiar with fine tuning neural networks

tarruda · on Oct 9, 2023

How does one decide which of these modules shouls be fine tuned for different tasks?