Train your own R1 reasoning model

danielhanchen · 2025-02-06T18:40:46 1738867246

Hey! We managed to reproduce GRPO (the algorithm behind DeepSeek R1) on a free Colab < 15GB for Llama 3.1 8B and Phi-4 14B! Colab: https://colab.research.google.com/github/unslothai/notebooks...

ai-christianson · 2025-02-06T19:30:17 1738870217

One problem we've had developing autonomous SWE agents (https://github.com/ai-christianson/RA.Aid) is that open models just haven't been performing near sonnet on controlling the agent. Our experience is echoed by many other agent devs out there, and you can see it for yourself if you try deepseek (v3 or r1) vs sonnet in any agentic product.

Do you think that your training setup could help train these models to be better at agentic work?

danielhanchen · 2025-02-06T19:51:20 1738871480

Cool repo! Agreed OSS models are still lagging, but they're definitely catching up!

So with GRPO and reinforcement learning, the OSS model creators now have one more tool to make OSS models much better, since we now don't need vast amounts of labeled CoT data, but rather just questions and answers, and we let RL / GRPO figure out the CoT itself after using some reward function.

So I guess it definitely can help in agentic workloads!

amrrs · 2025-02-06T18:47:32 1738867652

Great to see Unsloth here, how long did the training process take??

Also, the different version of the same og Colab didn't make a 135M model to learn the XML tags, so do you think 8 billion should be the minimum for use this?

danielhanchen · 2025-02-06T18:50:10 1738867810

Hey! Oh the notebook takes 1 to 3 hours on a free GPU - it's best to run it for 1 day for a full run - faster GPUs can be much better.

Yep - bigger models should help! Definitely give Llama a try!