Llama 1.3B Trained on 200B Tokens for Commercial Use

htrp · on April 28, 2023

>MPT-1b-RedPajama-200b-dolly is a 1.3 billion parameter decoder-only transformer pre-trained on the RedPajama dataset and subsequently fine-tuned on the Databricks Dolly instruction dataset.

> The model was pre-trained for 200B tokens (batch size 2200, sequence length 2048). It was trained on the following data mix:

    67% RedPajama Common Crawl
    15% C4
    4.5% RedPajama GitHub
    4.5% RedPajama Wikipedia
    4.5% RedPajama Books
    2.5% RedPajama Arxiv
    2% RedPajama StackExchange

Any benchmarks on model performance?

rc202402 · on April 28, 2023

As an AI language model I cannot comment on sensitive topics like LLM. Please try posting with an alternative title.

acapybara · on April 28, 2023

You're getting downvoted hard, but I found this pretty funny.

MuffinFlavored · on April 28, 2023

How/what goes into "number of parameter" selection? Why 1.3b? Why not more? Why not less?

jjkkjkjjk · on April 28, 2023

The chinchilla laws[0] The TLDR is that they tried a bunch of different combinations of number of tokens on the training set and parameters. The objective was to see the trade off between training time / cost and accuracy of the model. If you only want to spend a given amount of money or you only have a certain amount of data, then the chinchilla law tells you to choose around X number of parameters to reach a certain sweet spot

[0] https://lifearchitect.ai/chinchilla/

ftxbro · on April 28, 2023

Chinchilla scaling laws aren't necessarily used for the tradeoffs of training runs like this one.

Here is the goal of Chinchilla scaling from https://arxiv.org/pdf/2203.15556.pdf : "In this work, we revisit the question: Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens? To answer this question, we model the final pre-training loss L(N, D) as a function of the number of model parameters N, and the number of training tokens, D. Since the computational budget C is a deterministic function FLOPs(N, D) of the number of seen training tokens and model parameters, we are interested in minimizing L under the constraint FLOPs(N, D) = C."

But this isn't the only interesting optimization. It can also be useful to have small models that are "overtrained" relative to chinchilla optimality. There are practical reasons to prefer smaller models over big models even at some cost of training expense or pre-training loss, but the chinchilla optimization does not account for this at all. For one thing, the smaller models can fit in the memory of a wider class of devices. For another thing, smaller models incur less computational expense at inference time. Yet another reason is that they will probably have less latency.

Vecr · on April 28, 2023

I'm pretty sure RedPajama said they were going to overtrain in regards to chinchilla, so unless this is someone else or they ran out of time, this should be close to the original lama. Actually, I think this is someone else?