Distilling means fine-tuning an existing model using outputs from the bigger mod...

lr1970 · 2025-02-15T20:41:52 1739652112

> Distilling means fine-tuning an existing model using outputs from the bigger model.

Crucially, the output of the teacher model includes token probabilities so that the fine-tuning is trying to learn the entire output distribution.

numba888 · 2025-02-15T21:27:16 1739654836

That's possible only if they use the same tokens. Which likely requires they share the same tokenizer. Not sure that's the case here, R1 was built on OpenAI closed model's output.

anon373839 · 2025-02-15T22:11:15 1739657475

That was an (as far as I can tell) unsubstantiated claim made by OpenAI. It doesn’t even make sense, as o1’s reasoning traces are not provided to the user.

int_19h · 2025-02-15T23:01:42 1739660502

One reason to believe OpenAI here is that R1 will occasionally claim to be made by OpenAI, which in e.g. LLaMA finetunes is indicative of using synthetic data generated by ChatGPT.

Note that this isn't necessarily o1. While o1 is specifically trained to do CoT, you can also make 4o etc produce it with the appropriate prompts, and then train on that output.

HPsquared · 2025-02-16T09:57:29 1739699849

I suppose it might be hard to avoid encountering ChatGPT outputs "in the wild" now, even if they don't explicitly use it for training material.

sumedh · 2025-02-16T00:19:49 1739665189

> Google it!

Or you could provide some example links

dcre · 2025-02-16T14:23:22 1739715802

This only makes sense if I have some great canonical explanation of distillation on hand. But it’s a simple concept. There are hundreds of identical explanations online.

sumedh · 2025-02-17T11:05:45 1739790345

Are all the 100 explanations good or would you recommend one of them?