DeepSeek-R1

simonw · 2025-01-20T14:44:27 1737384267

OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...

The one I'm running is the 8.54GB file. I'm using Ollama like this:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0

You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this:

    uvx --with llm-ollama \
      llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
      'a joke about a pelican and a walrus who run a tea room together'

Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating: https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e381...

I also set an alias for the model like this:

    llm aliases set r1l 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0'

Now I can run "llm -m r1l" (for R1 Llama) instead.

I wrote up my experiments so far on my blog: https://simonwillison.net/2025/Jan/20/deepseek-r1/

simonw · 2025-01-20T19:20:49 1737400849

I got a quantized Llama 70B model working, using most of my 64GB of RAM but it's usable:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q3_K_M

That's a 34GB download. I'm accessing it via https://github.com/open-webui/open-webui which I ran like this:

    uvx --python 3.11 open-webui serve

I have TailScale on my laptop and phone so I can run experiments directly from my phone while leaving my laptop plugged in at home.

jtbaker · 2025-01-20T20:22:32 1737404552

Thanks for all your writing on these topics Simon! It has turned me from a bit of a naysayer to an optimist around this tooling, especially being able running stuff locally with access to tools. Have an M4 Pro mac mini arriving this week to have a similar self hosted setup over tailscale.

brewtide · 2025-01-24T12:11:40 1737720700

Maybe related, but I've also got a tailscale instance running for the same use case (on an older it box but...) I've also installed open-webui attached to ollama. That way the interface I can deal with when on my phone is just a simple nice to use webpage. May want to look into this? Thus far it's worked very slick.

jtbaker · 2025-01-26T04:57:37 1737867457

Nice, thanks for the suggestion. I got it set up just before leaving town for a few days, so have been doing a little tinkering with it. I was hoping to have a setup with LM Studio, where my laptop could use the API Server from the mini over the TS network. Unfortunately doesn't seem to be the case, so I'll set up a configuration like you mentioned to just have a global client from any device on the netowrk.

It's very cool to be able to have access to such a high horsepower machine from anywhere though. Next step is figuring out the networking interface to be able access the host GPU/ollama API from pods running in a Colima VM/k3s cluster setup.

pmarreck · 2025-01-21T00:40:46 1737420046

Time to put this 128GB M4 Mac to work...

    bash> ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q5_K_M

(49GB!)

(It wouldn't accept the Q_8 because it was sharded.)

Will report back lol

What app do you use on your phone to access your home LLM servers over tailscale?

sebws · 2025-01-21T09:00:40 1737450040

The tailscale app on iOS can turn on a VPN on the phone to connect into your tailnet. Once this is enabled, you can then access the tailnet IPs from anything, e.g., in your browser to hit a web ui. You can then take it further with local DNS and a reverse proxy to access your home server services at a URL.

simonw · 2025-01-21T04:24:20 1737433460

I'm running the Open WebUI Python web server app and accessing it on my phone via Mobile Safari.

Hasnep · 2025-01-21T04:43:03 1737434583

I use the GPTMobile app on Android to access ollama over Tailscale.

https://github.com/chocydar/GPTMobile

liamwire · 2025-01-21T02:37:03 1737427023

Not to speak for him, but I’d suspect Simon would be using a terminal emulator. He often mentions preferring to use his own command line tool, named LLM, as it allows one to log everything to an SQLite database for later exploration.

pmarreck · 2025-01-21T02:47:36 1737427656

Hah, I probably have a similar tool just called "ask" which logs to a JSON log instead of a sqlite db

pmarreck · 2025-01-21T02:48:40 1737427720

    bash> ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF:Q5_K_M
    pulling manifest
    ...
    success
    >>> Tell me a "what's the difference between" joke about techbros and another funny comparison of your choosing
    <think>
    Okay, so I need to come up with a "What's the difference between" joke comparing techbros and another group. Hmm. Let me think.

    First, what defines a techbro? They're usually associated with startups, tech gadgets, maybe working in Silicon Valley, into things like crypto, NFTs, and they often have a certain lingo or way of
    dressing—like hoodies, jeans, maybe beanies. They might be seen as overly optimistic about their ideas changing the world but sometimes disconnected from practical realities.

    Now, I need another group to compare them with. Maybe something else in tech but different enough to create humor. Let's think: developers? No, too similar. Maybe hipsters? Or perhaps venture capitalists?
    Wait, what about "developers who actually ship code"? That could work because it plays on the idea that while techbros talk a lot about their ideas, developers are more focused on delivering actual
    products.

    So the joke structure is usually two lines: one for each group, highlighting a key difference. It needs to be punchy and capture the essence of both groups in a humorous way.

    Let me draft it:

    "What's the difference between a techbro and a developer who actually ships code?"

    "Techbros are still explaining their MVP to you three years later, while developers just sent you a pull request titled 'it works now'."

    Wait, does that make sense? Let me break it down. A Minimum Viable Product (MVP) is something techbros often talk about but might take forever to actually build or ship. On the other hand, developers who
    are focused on shipping code would just send a pull request indicating they've fixed something and it's done now.

    I think that captures the difference between all talk and actual action. It's relatable for people in the tech industry who know the dynamics between idea people and implementers.

    Is there another angle? Maybe comparing them to something else, like entrepreneurs or investors, but I think the developer angle is solid because it directly contrasts talking about ideas versus executing
    them.

    So the final answer would be that joke. It highlights the techbro's tendency to over-explain without delivery, while developers focus on getting things done.
    </think>

    **What's the difference between a techbro and a developer who actually ships code?**

    *Techbros are still explaining their MVP to you three years later, while developers just sent you a pull request titled "it works now."*

    This joke humorously contrasts the tendency of techbros to focus on ideas and discussions with the action-oriented approach of developers who deliver tangible results.

thih9 · 2025-01-21T16:00:43 1737475243

IMO passable joke, not unlike modern content; if I saw it in the wild I’d laugh, feel called out and scroll down without thinking that AI may have written this.

m463 · 2025-01-23T02:24:51 1737599091

Funny, too much musing about the joke but then actually shipping it. must be from an AI, sort of an unfunny valley.

darthrupert · 2025-01-23T06:09:16 1737612556

The part between the "think" tags is (I guess) what ChatGPT-o1 hides under vague ux updates like "thinking about the problem". Perhaps the UI over ollama should hide them too.

Personally, I like to see the thought processes.

Mo3 · 2025-01-21T22:43:12 1737499392

> Let's think: developers? No, too similar

Ouch

dielll · 2025-01-21T04:10:53 1737432653

On Android use Termux.

It's a terminal emulator

matt_daemon · 2025-01-20T21:41:52 1737409312

> Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating

Very funny. How does the 70B model do in comparison?

regularfry · 2025-01-21T13:18:42 1737465522

I've previously had a 70B Q2 wedged into a 4090. Not much space for context but it runs...

jondwillis · 2025-01-21T03:32:44 1737430364

I set up a cloudflare tunnel recently so I could also use my local LLMs remotely! It’s nice.

unstatusthequo · 2025-01-28T23:06:11 1738105571

Yes, I did the same but with Docker serving open-webui. It's definitely not fast. Took 2 minutes to answer "what is the highest peak in California," but it eventually go there. OpenAI o1 took 6 seconds.

Pasting the logic for anyone interested:

<details type="reasoning" done="true" duration="102"> <summary>Thought for 102 seconds</summary> > Okay, so I need to figure out what the tallest peak in California is. Hmm, I remember that the highest mountain in the contiguous United States is Mount Whitney, but wait, isn't that in California? Let me think. Yes, I'm pretty sure it's in California. But maybe there are other tall peaks too. > > Wait, no, Mount Whitney is definitely the tallest. It's over 14,000 feet. But sometimes people might confuse it with other mountains like Shasta or something else. Oh, wait, isn't Mount Shasta also a volcano? I think that's in California too but it's not as tall as Whitney. > > So, putting it together, Mount Whitney is the tallest peak in California at 14,505 feet. It's located in the Sierra Nevada range. People hike there a lot because it's such a high point. But I should double-check to make sure I'm not mixing up anything else. Yeah, that seems right. > I wonder if there are any taller peaks elsewhere. Oh, Denali in Alaska is higher, but that's not in California. So within the state, Whitney is the highest. Got it. > So the answer is Mount Whitney at 14,505 feet above sea level. It's a popular spot for hikers and climbers. Makes sense because it's such an iconic peak in the Sierra Nevada mountains. > I think I've got all that right. No other peaks in California are taller than Whitney. So yeah, Mount Whitney is the tallest. > Just to make sure, sometimes there are different names or maybe alternate names for peaks, but Whitney is pretty straightforward. Yeah, it's definitely the one. > Okay, so I'm confident now. The answer is Mount Whitney. > *Final Answer* > The tallest peak in California is Mount Whitney, with an elevation of 14,505 feet above sea level. > > \boxed{Mount Whitney} </details> To determine the tallest peak in California, we consider notable mountains such as Mount Whitney and Mount Shasta. Mount Whitney is recognized as the highest peak in the contiguous United States, standing at 14,505 feet. It is located in the Sierra Nevada mountain range and is a popular destination for hikers. Mount Shasta, although also a significant peak, does not surpass Whitney's elevation. Therefore, the tallest peak in California is confirmed to be Mount Whitney.

*Final Answer* The tallest peak in California is \boxed{Mount Whitney}.

peeters · 2025-01-21T03:56:40 1737431800

> Wait, maybe the punchline is something like: "We don’t have any fish in the tea, but we do have a lot of krill."

Shucks, it was so close to coming up with a good punchline it could work back from.

I'm thinking set it in a single-cell comic. A downtrodden young man or woman sitting alone at a table, a pelican in the background clearly making drinks in its voluminous beak, and the walrus waiter places a cup in front of the person, consolingly saying "there's plenty of fish in the tea".

joenot443 · 2025-01-21T09:39:45 1737452385

I chuckled, that's a very clean punchline. Nicely done.

These days I sometimes feel as if my skills as an iOS dev are losing market value. Maybe it's time I get back to writing standup :)

tough · 2025-01-22T07:15:46 1737530146

I had to make it an image

https://replicate.delivery/xezq/BfkZZEPS3CQHKa9HfOQTetyORSaW...

peeters · 2025-01-22T15:22:44 1737559364

Haha I was secretly hoping someone would, thanks! Is this your artwork or generated? Either way I'm impressed!

peeters · 2025-01-22T22:41:59 1737585719

Answering my own question - I think generated, because (took me WAY too long to notice) the dude has three arms.

tough · 2025-01-24T06:54:14 1737701654

lmao, yes its ai generated, custom FLUX model trained on a specific meme-style but i can actually use it for any drawings

there was also a slight 'editing by curating as i choose one of the 4 outputs i got

being generated more obv on the others (i think i missed that third arm too lmao)

https://replicate.delivery/xezq/NkTMmG5ZLrIpLhr38zNR6RMeN4C2...

https://replicate.delivery/xezq/lzZSIkCGnXKFI90Rw3ANFwOvrZBn...

noticed a typo on my first prompt had to rerun without

https://replicate.delivery/xezq/7MU7JhaNyiZlN18gXZrfPkn9ZMgG...

https://replicate.delivery/xezq/d5ELWfhSNk2REqtFe6fn1ouUIf5n...

https://replicate.delivery/xezq/YQwQ6tMLQKZ2LlwkXdBu1OUR9Yue...

https://replicate.delivery/xezq/VkfUjywopD0SEim4uERM22oetff3...

scotty79 · 2025-01-21T19:52:03 1737489123

"Meanwhile, the pelican tries to help by filtering it with its webbed feet."

I think it doesn't quite get what webbed feet mean.

hoseja · 2025-01-22T09:16:46 1737537406

It is a weird term. Why not paddle feet. "Web" for a membrane makes little sense.

fud101 · 2025-01-21T07:05:24 1737443124

but we do have a lot of time to krill would have worked better. I saw it instantly from your post, not sure if it's funny or not though.

richardw · 2025-01-21T08:54:51 1737449691

While not world class, both of those are instantly better than the bag ‘o tokens served by the LLM.

Honestly it’s a bit like the porn definition. We know intelligence when we see it, even if it’s hard to define. Adding more tokens doesn’t magically phase-change into real smarts. We’ll get there, but not today.

patcon · 2025-01-21T19:05:41 1737486341

And yet none of it would have ever been experienced without the guidance toward this thinking by an LLM... ;)

Maybe future LLMs will be their most pro-social by playing stupid, and letting us feel like we came up with all the good idea they gracelessly-but-gracefully point our attention toward (not saying that's happening here, but maybe it's where we should make it stop)

HarHarVeryFunny · 2025-01-20T17:13:10 1737393190

I think the problem is that humor isn't about reasoning and logic, but almost the reverse - it's about punchlines that surprise us (i.e. not what one would logically anticipate) and perhaps shock us by breaking taboos.

Even masters of humor like Seinfeld, with great intuition for what might work, still need to test new material in front of a live audience to see whether it actually does get a laugh or not.

ilaksh · 2025-01-21T07:42:28 1737445348

Good point, but this is an 8GB model. That's the problem. This is a spectacular answer given that constraint. Also it's hilarious how bad the final answer is.

HarHarVeryFunny · 2025-01-21T14:24:08 1737469448

Yes - impressive how good the small models are getting, and this "reasoning distillation" seems to have given them a significant boost.

Even though humor is largely about the unanticipated punchline, I'd have guessed (maybe wrongly) that there'd be enough analytical discussion of humor in the training set for a reasoning model to come up with a much more plausible attempt at a formulaic type of joke.

From the example given it seems there's too much "thought" put into "what do I have to work with here", and not enough into conceiving/selecting a template for the joke. Maybe part of the problem is that the LLM doesn't realize that, being an LLM, it's best chance at being funny to a human is to closely stick to a formula that humans find funny, and not try to be too smart in trying to deconstruct it.

iamnotagenius · 2025-01-21T18:26:42 1737484002

small models can be funny (12b model):

tell me funny story about techbro and real developer who actually ships code

Sure, here's a light-hearted story for you:

---

Once upon a time in Silicon Valley, there was a shiny new startup called "TechBro Inc." It was filled with eager young techbros, all wearing their "10x Developer" t-shirts, and sipping on their $8 cold-pressed juices. Among them was Chad, the self-proclaimed "Tech Bro King."

Chad was always posting his impressive code snippets on his social media profiles, bragging about how he could "crush" any coding challenge in no time. He was the epitome of a techbro, all talk and no walk.

Meanwhile, in the corner, there was Alex, the quiet, unassuming developer who actually shipped code. Alex didn't have flashy t-shirts or a massive online following, but the company's products worked, and that was thanks to Alex.

One day, TechBro Inc. decided to host a "Hackathon" to impress their investors. Chad saw this as his opportunity to shine. He spent the entire day tweeting about his progress, posting screenshots of his code, and even live-streaming his "epic coding session."

Alex, on the other hand, quietly sat at his desk, coding away. He didn't take any breaks for photoshoots or live streams. He was there to work, not to perform.

As the hackathon came to a close, Chad confidently declared, "I've just finished the most amazing feature ever! It's going to revolutionize our app!" The room erupted in applause, and Chad basked in the glory.

The next day, the team gathered to demo their new features. Chad confidently presented his "revolutionary" feature. He clicked the button to launch it, but nothing happened. He tried again, and again, but the app remained unchanged. Chad's face turned red, and sweat dripped down his forehead.

It was Alex's turn. He quietly stepped up, clicked a button, and suddenly, the app was filled with new, useful features. The team gasped in awe. The investors were impressed. The app actually worked, and that was thanks to Alex.

Chad, trying to save face, turned to Alex and said, "Well, Alex, it seems you've just... shipped some code. Not bad for a... real developer."

The room burst into laughter. From that day forward, Chad learned the difference between talking about coding and actually doing it. And Alex? He just kept shipping code, one line at a time.

---

valval · 2025-01-22T14:54:12 1737557652

That is beyond terrible. Are you joking? I feel sad that I spent a minute reading that.

simonw · 2025-01-20T17:28:04 1737394084

Yeah, getting LLMs to invent jokes is an incredibly unfair test of them - but I derive enormous entertainment from seeing them try.

exhaze · 2025-01-20T21:52:45 1737409965

LLMs output can often be like a prism in front of a mirror - a fusion of humanity with the specific human, reflected back at the human.

Simon, perhaps you're just not funny.

kridsdale1 · 2025-01-21T08:21:23 1737447683

How Nietzschian

HarHarVeryFunny · 2025-01-20T17:38:16 1737394696

I just tried Claude Sonnet with the Pelican & Walrus setup, but asked it for something in style of Norm Macdonald's humor, which would seem a potentially good fit for this type of idea. It got the idea of a rambling story (cf Macdonald's moth joke) that mostly ignored them being a Pelican and Walrus, which seemed promising, but still failed to deliver. I'm guessing with some more guidance and iteration it could have come up with something.

astrange · 2025-01-20T21:23:49 1737408229

I've been trying out "write and perform an MST3K episode reviewing your previous answer" as a prompt. Have yet to get a funny answer from any model.

Older base models produce some great stuff though, at least if you're me and think RoboRosewater and dril_gpt2 are funny.

stavros · 2025-01-20T23:11:52 1737414712

I've found that the weaker the model, the more of an unhinged, LSD-trip response you get. I made a podcast years ago (https://deepdreams.stavros.io) and GPT-2 was definitely the more hilarious of the models.

Chris2048 · 2025-01-27T20:33:21 1738010001

> Have yet to get a funny answer from any model.

Just like the real thing! Promising results!

MarcelOlsz · 2025-01-21T00:28:39 1737419319

I'd love to see a standup competition of trying to make AI jokes/sets funny.

darthrupert · 2025-01-23T06:15:48 1737612948

All this reminds me of another genius comedy duo, David Lynch and Louis CK: https://m.youtube.com/watch?v=HlEJbs02wAM

Come on, LLM, be funny! 3 2 1 go!

momojo · 2025-01-20T18:36:22 1737398182

> the joke itself is rubbish but the "thinking" section is fascinating:

This is gold. If I was a writer, I'd wring value from that entire thinking-out-loud section and toss the actual punchline.

This is weirdly reminiscent of co-programming with CodyAI. It gives me a lot of good 'raw material' and I'm left integrating the last mile stuff.

BHSPitMonkey · 2025-01-21T06:50:02 1737442202

I like to imagine the entire transcript as a Norm Macdonald bit, a la his moth joke

_sys49152 · 2025-01-20T19:23:39 1737401019

looking at how it thinks, it can help bust you out when you write yourself into a corner.

monkeydust · 2025-01-20T15:50:22 1737388222

Thanks! Playing around with this vs the https://ollama.com/tripplyons/r1-distill-qwen-7b variant and find 7b to be somewhat of sweet spot of getting to the point with minimal (or less) waffle.

Certainly, interesting reading their thought processes, value in that might be greater than the answer itself depending on use-case.

wat10000 · 2025-01-20T18:28:05 1737397685

This joke is so terrible, I think this might end up being how AI kills us all when it decides it needs us out of the way to make more paperclips.

stavros · 2025-01-20T23:07:39 1737414459

Wait until you see what I got when prompting it for a joke about three priests:

> Why did the rabbi become a boxer?

> Because he couldn't handle the holy punches!

> But wait, that's not quite right. Let me try again.

> Here’s one: What do you call a Jewish, Orthodox, and Catholic priest working together?

> A trio of men who probably shouldn’t be telling jokes after their day job!

> Sorry for the confusion earlier—I hope this one makes more sense!

It did not.

wat10000 · 2025-01-21T00:06:14 1737417974

That second one isn’t funny, but it’s still interesting somehow.

Actually, it would work as a joke, but it would have to be told by an actual priest hanging out after work with two other actual priests. Ultra-niche humor.

rererereferred · 2025-01-21T14:01:44 1737468104

> What do you call a Jewish, Orthodox, and Catholic priest working together?

A miracle.

widdershins · 2025-01-20T15:39:15 1737387555

Yeesh, that shows a pretty comprehensive dearth of humour in the model. It did a decent examination of characteristics that might form the components of a joke, but completely failed to actually construct one.

I couldn't see a single idea or wordplay that actually made sense or elicited anything like a chuckle. The model _nearly_ got there with 'krill' and 'kill', but failed to actually make the pun that it had already identified.

samdjstephens · 2025-01-20T15:56:38 1737388598

Yeah it's very interesting... It appears to lead itself astray: the way it looks at several situational characteristics, gives each a "throw-away" example, only to then mushing all those examples together to make a joke seems to be it's downfall in this particular case.

Also I can't help but think that if it had written out a few example jokes about animals rather than simply "thinking" about jokes, it might have come up with something better

laweijfmvo · 2025-01-21T00:26:47 1737419207

why shouldn’t i assume that the “thinking” is just the usual LLM regurgitation of “how would a human coming up with a joke explain their reasoning?” or something like that, and zero “thinking”?

wrsh07 · 2025-01-21T02:48:22 1737427702

Given how it is trained specifically (they didn't encourage it to think, they allowed it to) there was a lot of emergent behavior as it trained.

Sort of like chess engines rediscovering classic (named) chess openings. See section 2.2.3 for the training template (it's a single paragraph I can't reproduce here because I'm on my phone)

Example emergent behavior (section 2.2.4 page 8): the model learns to solve more complex problems by spending more time reasoning. It also naturally develops reflection (what have I tried?) and exploration strategies.

Fundamentally, you should think of this as a nn that learned to solve real problems by reasoning about them in written language.

(My favorite part: it defaulted to reasoning in multiple languages. They constrained it to only reason in a single language and this negatively impacted performance! But the hypothesis is that it improves interpretability)

mordae · 2025-01-21T07:27:27 1737444447

Ever read philosophy? An acquaintance can and will readily mix and match 3 languages to obtain more precision.

barrkel · 2025-01-21T00:58:29 1737421109

You shouldn't assume that because you shouldn't assume the converse either.

The LLM will use the chain of thought, as it's being built out like a garden path, to explore different completion possibilities. It's not necessarily logically related what it completes afterwards, but it's definitely influenced.

The search may not be valuable as an artifact in itself. It's likely to be logically unsound especially in parts. But the end result may be useful.

brookst · 2025-01-21T00:38:20 1737419900

What’s the difference?

bitwalker · 2025-01-21T00:47:23 1737420443

At least to me, the difference is that one is ostensibly an explanation of how the AI arrived at the joke, the other is a post-hoc explanation of the joke.

gpm · 2025-01-21T00:55:14 1737420914

You can be pretty sure the AI isn't doing a post-hoc explanation because the only writable memory it has access to is the tokens it has already output - i.e. the explanation of the joke. Everything else is reset between every token.

As long as it comes up with different jokes different times you ask it (assuming randomness in sampling) - how could it.

baq · 2025-01-21T06:26:09 1737440769

The problem is it can’t remember what it hasn’t written but the end result still makes sense, so there has to be some goal after parsing the initial context that the tokens are emitted towards to. This means there’s nothing stopping it from producing an explanation, it might be in there from the very start.

gpm · 2025-01-21T19:25:50 1737487550

Not one goal though, but a set of goals. Otherwise the random sampling of tokens wouldn't result in it outputting meaningfully different jokes.

I also think it's safe to assume that the set of goals isn't fully resolved, but it's harder to "prove" that.

svachalek · 2025-01-21T15:20:59 1737472859

There's no goal. The tokens are computed one at a time from all the previous tokens.

baq · 2025-01-21T16:33:21 1737477201

One is orthogonal to the other.

croemer · 2025-01-20T22:11:43 1737411103

Can someone ELI5 what the difference is between using the "quantized version of the Llama 3" from unsloth instead of the one that's on ollama, i.e. `ollama run deepseek-r1:8b`?

whereismyacc · 2025-01-20T22:24:04 1737411844

The weights are quantized down to fewer bits in order to save on memory. The quantization loss is going to result in worse generations.

ColonelPhantom · 2025-01-20T22:37:23 1737412643

Ollama serves multiple versions, you can get Q8_0 from it too:

ollama run deepseek-r1:8b-llama-distill-q8_0

The real value from the unsloth ones is that they were uploaded before R1 appeared on Ollama's model list.

AS04 · 2025-01-21T02:26:54 1737426414

Unsloth also works very diligently to find and fix tokenizer issues and many other problems as soon as they can. I have comparatively little trust on ollama following up and updating everything in a timely manner. Last I checked, there is little information on when the GGUFs and etc. on ollama were updated or what llama.cpp version / git commit did they use for it. As such, quality can vary and be significantly lower with the ollama versions for new models I believe.

dragonwriter · 2025-01-21T16:05:44 1737475544

They are probably the same model, unsloth does model quants and provides them to the community, AFAIK ollama doesn't, they just indexes publicly available models, whether full or quantized, for convenient use in their frontend.

reissbaker · 2025-01-20T15:36:53 1737387413

FWIW, you can also try all of the distills out in BF16 on https://glhf.chat (either in the UI or via the API), including the 70b. Personally I've been most impressed with the Qwen 32b distill.

(Disclosure: I'm the cofounder)

reissbaker · 2025-01-20T18:52:50 1737399170

(And here's the output of the "pelican and the walrus in a tearoom" joke from DeepSeek's distilled 70b... Mildly better?)

<think> Alright, I'm supposed to come up with a joke about a pelican and a walrus in a tea room. Hmm, sounds like a classic setup for a funny situation. Let me think about the characteristics of these animals that could be used humorously. Okay, so pelicans are known for their large beaks, right? They can carry a lot of water in them. Walruses have those big tusks and blubbery bodies. Maybe I can play on the idea of them trying to have a tea party but their physical traits causing some trouble.

Wait, being in a tea room, they might be trying to sip tea, but their size or features get in the way. The pelican's beak could be problematic with the cup. And the walrus's size might knock things over or make a mess.

Let me consider the flow. Maybe the pelican struggles with the teacup because of his beak, and the walrus has a comment about it. Then the walrus tries to sip but ends up splashing tea everywhere. That could lead to a funny exchange.

Putting it together: Pelican says something self-deprecating about his beak not being suited for tea. Then the walrus, who perhaps agrees, tries to sip and splashes tea, making them both laugh. That feels natural and funny.

I should make sure the punchline is clear and the dialogue flows smoothly. Maybe start with the pelican struggling, then the walrus makes a comment, and their interaction leads to the humorous conclusion. Yeah, that should work. </think>

Why did the pelican and the walrus get kicked out of the tea room?

Because the pelican kept dipping his beak in the wrong cup, and the walrus kept breaking the delicate china with his tusks. They just couldn’t “sea” themselves fitting in!

jpc0 · 2025-01-22T15:00:04 1737558004

Apparently I'm a sucked for bad jokez...

It wasn't that good but the punchline actually got me a giggle...

oidar · 2025-01-21T03:06:58 1737428818

Might want to let people know that you have to pay to try things ahead of time. It's not obvious when on the launch page.

reissbaker · 2025-01-21T05:22:55 1737436975

The launch page has a pretty prominent pricing link — what would make it more obvious?

oidar · 2025-01-21T05:41:40 1737438100

Something about pricing above the fold: https://ibb.co/rFR41rK

reissbaker · 2025-01-21T08:28:05 1737448085

Good point! Thank you for the screenshot :)

gjm11 · 2025-01-21T02:36:51 1737427011

What's your sense of how useful local LLMs are for things other than ... writing blog posts about experimenting with local LLMs? :-)

(This is a serious question, not poking fun; I am actually curious about this.)

simonw · 2025-01-21T04:29:45 1737433785

Six months ago I had almost given up on local LLMs - they were fun to try but they were so much less useful than Sonnet 3.5 / GPT-4o that it was hard to justify using them.

That's changed in the past two months. Llama 3 70B, Qwen 32B and now these R1 models are really impressive, to the point that I'm considering trying to get real work done with them.

The catch is RAM: I have 64GB, but loading up a current GPT-4 class model uses up around 40GB of that - which doesn't leave much for me to run Firefox and VS Code.

So I'm still not likely to use them on a daily basis - but it does make me wonder if I should keep this laptop around as a dedicated server next time I upgrade.

gjm11 · 2025-01-21T17:33:05 1737480785

Thanks!

One reason why I'm asking is that I'm in the market for a new laptop and am wondering whether it's worth spending more for the possible benefits of being able to run ~30-40GB local LLMs.

Unfortunately it doesn't look as if the answer is either "ha ha, obviously not" or "yes, obviously". (If the question were only about models available right now I think the answer would be no, but it seems like they're close enough to being useful that I'm reluctant to bet on them not being clearly genuinely useful a year from now.)

simonw · 2025-01-21T17:41:22 1737481282

Yeah, it's not an obvious answer at all. Spending ~$3,000+ on a laptop to run local models is only economically sensible if you are VERY paranoid about using APIs (there are plenty of API providers that I personally trust not to train on my data) - otherwise that $3,000 will buy you many years worth of access to the best available models via API.

gjm11 · 2025-01-21T19:15:31 1737486931

Well, I unfortunately have expensive tastes in laptops anyway, so the delta is substantially less than $3k, and it's possible that from time to time I'll run across other things that benefit from the fancier machine, and if I don't get a 64GB Mac one of the other possibilities is a 48GB Mac which would still be able to run some local LLMs. But, all that said, it's still potentially a sizable chunk of money for a dubious benefit.

I've been assuming that privacy isn't the only benefit of local; it seems like a local model would offer more flexibility for fine-tuning, RAG, etc., though I am completely ignorant of e.g. what size of model it's actually feasible to do any useful fine-tuning to on given hardware.

jhonof · 2025-01-21T02:40:39 1737427239

If you are worried about security or IP at all, it's preferable to run locally, or spin up your own box that you can query running one of these models.

gjm11 · 2025-01-21T17:27:20 1737480440

I understand the reasons for not wanting to use a remote LLM. My question was about how useful local LLMs are. It might turn out that for some people remote LLMs are unacceptable for privacy reasons and local LLMs are unacceptable because they aren't good enough to be useful.

TeMPOraL · 2025-01-20T22:09:28 1737410968

Did you try the universal LLM cheat code as a followup prompt?

"Make it better"

svachalek · 2025-01-21T15:23:25 1737473005

These "thinking" LLMs are essentially productizing that trick. All the rambling and second guessing they go through at the beginning gives them source material to compute the "better" answer from at the end.

TeMPOraL · 2025-01-21T21:38:43 1737495523

Sure, but you can still re-prompt them again telling them to just do better.

In case people missed it, I'm referencing an observation recently made by 'minimaxir, described here:

https://minimaxir.com/2025/01/write-better-code/

As it turns out, you can improve the quality of code generated by some LLMs by repeatedly responding to it with just three words: "write better code".

While Max only tested this on Claude 3.5 Sonnet, I see no reason why this wouldn't work with the "thinking" models either. Even if it doesn't the results might still be interesting. With that in mind, here's the article's experiment applied to o1-preview:

https://cloud.typingmind.com/share/69e62483-45a4-4378-9915-6...

Eyeballing the output, it seems to align with the article's observation.

(o1-preview is the only "thinking" model I currently have API access to; official ChatGPT app doesn't let me reply to o1 family, forcing further interactions to be with "non-thinking" models instead.)

lmc · 2025-01-20T16:02:45 1737388965

> The walrus might say something like, "We have the biggest catch in town," while the pelican adds a line about not catching any fish recently.

It should've stopped there :D

earth2mars · 2025-01-21T01:25:25 1737422725

Tried exactly the same model. And unfortunately the reasoning is just useless. Built it is still not able to tell how many r's in strawberry.

spacemanspiff01 · 2025-01-21T01:31:32 1737423092

That's a tokenizer issue though?

zamadatix · 2025-01-21T01:50:27 1737424227

Not 100% so for chain of thought models, they should recognize to spell the word letter by letter in some separated form and then count the tokens in that form. The Qwen distill seems to do exactly this really well:

> Step-by-step explanation:

> 1. Break down each word: "not", "really", "a", "tokenizer", "issue".

> 2. Count 'e's in each word:

> - "not": 0

> - "really": 1

> - "a": 0

> - "tokenizer": 2

> - "issue": 1

> 3. Sum the counts: 0 + 1 + 0 + 2 + 1 = 4.

>

> Answer: There are 4 E's in the phrase.

In the thought portion it broke the words up every which way you could think to check then validated the total by listing the letters in a number list by index and counting that compared to the sums of when it did each word.

spacemanspiff01 · 2025-01-21T11:10:46 1737457846

But the only way to do this is if it is trained on how to map the word token to character tokens ie

Hello -> h e l l o 66547 -> 12 66 88 88 3

Or, maybe it memorized that hello has a single e.

Either way, This seems to be a edge case that may or may not exist in the training data, but seems orthogonal to 'reasoning'

A better test case would be how it performs if you give the spelling mappings for each word the context?

zamadatix · 2025-01-21T16:22:22 1737476542

"Be trained how to map" implies someone is feeding in a list of every token and what the letters for that token are as training data and then training that. More realistically, this just happens automatically during training as the model figures out what splits work with which tokens because that answer was right when it came across a spelling example or question. The "reasoning" portion comes into play by its ability to judge whether what it's doing is working rather than go with the first guess. E.g. feeding "zygomaticomaxillary" and asking for the count of 'a's gives a CoT

> <comes to an initial guess> > Wait, is that correct? Let me double-check because sometimes I might miscount or miss letters. > Maybe I should just go through each letter one by one. Let's write the word out in order: > <writes one letter per line with the conclusion for each > *Answer:* There are 3 "a"s in "zygomaticomaxillary."

It's not the only example of how to judge a model but there are more ways to accurately answering this problem than "hardcode the tokenizer data in the training" and heavily trained CoT models should be expected to hit on at least several of these other ways or it is suspect they miss similar types of things elsewhere.

svachalek · 2025-01-21T15:28:36 1737473316

The important thing is, when we're all replaced by robots, deep down we will know we are superior because we can count letters in strawberry.

ryanisnan · 2025-01-20T20:22:07 1737404527

Super interesting. It seems to get hung up on a few core concepts, like the size of the walrus vs. the limited utility of a pelican beak.

jonplackett · 2025-01-21T00:06:08 1737417968

This is probably pretty similar to my inner monologue as I would try and inevitably fail to come up with a good joke.

newman314 · 2025-01-21T19:14:15 1737486855

Have you had a chance to compare performance and results between the Qwen-7B and Llama-8B versions?

riwsky · 2025-01-21T05:02:42 1737435762

“I never really had a childhood”, said Walrus, blowing on his tea with a feigned sigh. “Why’s that?” asked Pelican, refilling a sugar shaker. Walrus: “I was born long in the tooth!” Pelican: [big stupid pelican laughing noise]

dcreater · 2025-01-21T00:12:52 1737418372

Why ask it for a joke? That's such a bad way to try out a reasoning model

marci · 2025-01-21T10:59:36 1737457176

Would that be funny? Failing a job interview because, when ask to make a joke about a pelican and a walrus who run a tea room together, you couldn't come up with a quip, therefore showing your inability to reason.

Let's not even talk about the "r" you forgot when asked to write "cranberry"...

Weetile · 2025-01-21T00:46:37 1737420397

Creating a joke is a great demonstration of a multistage reasoning/deduction process

HeatrayEnjoyer · 2025-01-21T00:21:58 1737418918

It's a great test. Creating a joke that is both funny and novel requires many steps.

simonw · 2025-01-21T00:21:43 1737418903

Watching them attempt to reason their way to a joke is really amusing.

qqqult · 2025-01-21T06:49:30 1737442170

I asked deepseek-14b for a joke in the style of Norm Mcdonald, and it delivered an overexplained deadpan absurd joke that I thought was pretty similar to Norms style. Then I tried again with different styles and realized that all jokes it tries to make are in that same style as if it tries to follow a manual on how to construct a joke without actually being funny. Which is pretty funny

fsndz · 2025-01-21T08:26:03 1737447963

frankly ollama + Deepseek is all you need to win with open source AI. I will do some experiments today and add it to my initial blogpost. https://medium.com/thoughts-on-machine-learning/deepseek-is-...

linsomniac · 2025-01-20T15:40:46 1737387646

>a joke about a pelican and

Tell me you're simonw without telling me you're simonw...

tomrod · 2025-01-20T23:28:20 1737415700

Can you recommend hardware needed to run these?

simonw · 2025-01-21T00:33:31 1737419611

I'm using an M2 64GB MacBook Pro. For the Llama 8B one I would expect 16GB to be enough.

I don't have any experience running models on Windows or Linux, where your GPU VRAM becomes the most important factor.

dragonwriter · 2025-01-21T16:08:25 1737475705

On Windows or Linux you can run from RAM or split layers between RAM and VRAM; running fully on GPU is faster than either of those, but the limit on what you can run at all isn’t VRAM.

akhdanfadh · 2025-01-22T14:24:11 1737555851

So is it possible to load the ollama deepseek-r1 70b (43gb) model on my 24gb vram + 32gb ram machine? Does this depend on how I load the model, i.e., with ollama instead of other alternatives? Afaik, ollama is basically llama.cpp wrapper.

I have tried to deploy one myself with openwebui+ollama but only for small LLM. Not sure about the bigger one, worried if that will crash my machine someway. Are there any docs? I am curious about this and how that works if any.

rane · 2025-01-21T09:42:51 1737452571

Why isn't GPU VRAM a factor on a Silicon mac?

Nekit1234007 · 2025-01-21T10:34:29 1737455669

Because there's no VRAM. The “regular” RAM on Apple Silicon devices is shared with the GPU virtually in its entirety.

yencabulator · 2025-01-21T16:49:45 1737478185

ollama runs deepseek-r1:7b on AMD 8945HS, CPU-only, at ~12 tokens/s. You can get started pretty easily in the ~7B model range, for learning purposes.

fpgaminer · 2025-01-20T23:55:54 1737417354

I think "reasoning" models will solve the joke issue (amongst other issues), but not because they're "reasoning". Rather because they help solve the exploration issue and the scaling issue.

Having worked with LLMs a lot for my JoyCaption project, I've got all these hypothesis floating around in my head. I guess the short version, specifically for jokes, is that we lack "joke reasoning" data. The solution, like mathematical problems, is to get the LLM to generate the data and then RL it into more optimal solutions.

Longer explanation:

Imagine we want an LLM to correctly answer "How many r's are in the word strawberry?". And imagine that language has been tokenized, and thus we can form a "token space". The question is a point in that space, point Q. There is a set of valid points, set A, that encompasses _any_ answer to this question which is correct. There are thus paths through token space from point Q to the points contained by set A.

A Generator LLM's job is, given a point, predict valid paths through token space. In fact, we can imagine the Generator starting at point Q and walking its way to (hopefully) some point in set A, along a myriad of inbetween points. Functionally, we have the model predict next token (and hence point in token space) probabilities, and we can use those probabilities to walk the path.

An Ideal Generator would output _all_ valid paths from point Q to set A. A Generator LLM is a lossy compression of that ideal model, so in reality the set of paths the Generator LLM will output might encompass some of those valid paths, but it might also encompass invalid paths.

One more important thing about these paths. Imagine that there is some critical junction. A specific point where, if the Generator goes "left", it goes into a beautiful flat, grassy plain where the sun is shining. That area is really easy to navigate, and the Generator LLM's predictions are all correct. Yay! But if it goes "right" it ends up in the Fire Swamp with many dangers that it is not equipped to handle. i.e. it isn't "smart" enough in that terrain and will frequently predict invalid paths.

Pretraining already taught the Generator LLM to avoid invalid paths to the best of its abilities, but again its abilities are limited.

To fix this, we use RL. A Judge LLM takes a completed path and determines if it landed in the set A or not. With an RL algorithm and that reward signal, we can train the Generator LLM to avoid the Fire Swamp, since it often gets low rewards there, and instead goes to the Plain since it often gets rewards there.

This results in a Generator LLM that is more _reliable_ and thus more useful. The RL encourages it to walk paths it's good at and capable of, avoid paths it struggles with, and of course encourages valid answers whenever possible.

But what if the Generator LLM needs to solve a really hard problem. It gets set down at point Q, and explores the space based on its pretraining. But that pretraining _always_ takes it through a mountain and it never succeeds. During RL the model never really learns a good path, so these tend to manifest as hallucinations or vapid responses that "look" correct.

Yet there are very easy, long paths _around_ the mountain that gets to set A. Those don't get reinforced because they never get explored. They never get explored because those paths weren't in the pretraining data, or are so rare that it would take an impractical amount of exploration for the PT model to output them.

Reasoning is one of those long, easy paths. Digestible small steps that a limited Generator LLM can handle and use to walk around the mountain. Those "reasoning" paths were always there, and were predicted by the Ideal Generator, but were not explored by our current models.

So "reasoning" research is fundamentally about expanding the exploration of the pretrained LLM. The judge gets tweaked slightly to encourage the LLM to explore those kinds of pathways, and/or the LLM gets SFT'd with reasoning data (which is very uncommon in its PT dataset).

I think this breakdown and stepping back is important so that we can see what we're really trying to do here: get a limited Generator LLM to find its way around areas it can't climb. It is likely true that there is _always_ some path from a given point Q and set A that a limited Generator LLM can safely traverse, even if that means those paths are very long.

It's not easy for researchers to know what paths the LLM can safely travel. So we can't just look at Q and A and build a nice dataset for it. It needs to generate the paths itself. And thus we arrive at Reasoning.

Reasoning allows us to take a limited, pretrained LLM, and turn it into a little path finding robot. Early during RL it will find really convoluted paths to the solution, but it _will_ find a solution, and once it does it gets a reward and, hopefully, as training progresses, it learns to find better and shorter paths that it can still navigate safely.

But the "reasoning" component is somewhat tangential. It's one approach, probably a very good approach. There are probably other approaches. We just want the best ways to increase exploration efficiently. And we're at the point where existing written data doesn't cover it, so we need to come up with various hacks to get the LLM to do it itself.

The same applies to jokes. Comedians don't really write down every single thought in their head as they come up with jokes. If we had that, we could SFT existing LLMs to get to a working solution TODAY, and then RL into something optimal. But as it stands PT LLMs aren't capable of _exploring_ the joke space, which means they never come out of the RL process with humor.

Addendum:

Final food for thought. There's kind of this debating going on about "inference scaling", with some believing that CoT, ToT, Reasoning, etc are all essentially just inference scaling. More output gives the model more compute so it can make better predictions. It's likely true that that's the case. In fact, if it _isn't_ the case we need to take a serious look at our training pipelines. But I think it's _also_ about exploring during RL. The extra tokens might give it a boost, sure, but the ability for the model to find more valid paths during RL enables it to express more of its capabilities and solve more problems. If the model is faced with a sheer cliff face it doesn't really matter how much inference compute you throw at it. Only the ability for it to walk around the cliff will help.

And, yeah, this all sounds very much like ... gradient descent :P and yes there have been papers on that connection. It very much seems like we're building a second layer of the same stuff here and it's going to be AdamW all the way down.

kridsdale1 · 2025-01-21T08:39:49 1737448789

I’m on my phone so I can’t give this a proper response but I want to say that your mental intuition about the latent space algorithms is excellent and has improved my thinking. I haven’t seen much writing applying pathfinding (what we used to call AI, in the Half Life days) terminology to this. Your ideal generator sounds like letting A* run on all nodes in a grid and not exiting when the first path is found.

Mountains and cliffs are a good way to describe the terrain of the topology of the weights in hyper dimensional space though they are terms for a 2D matrix.

byteknight · 2025-01-20T21:07:51 1737407271

Disclaimer: I am very well aware this is not a valid test or indicative or anything else. I just thought it was hilarious.

When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It counts properly, and then says to it self continuously, that can't be right.

https://gist.github.com/IAmStoxe/1a1e010649d514a45bb86284b98...

kbr- · 2025-01-20T23:09:59 1737414599

Ahhahah that's beautiful, I'm crying.

Skynet sends Terminator to eradicate humanity, the Terminator uses this as its internal reasoning engine... "instructions unclear, dick caught in ceiling fan"

xiphias2 · 2025-01-20T21:21:27 1737408087

It's funny because this simple excercise shows all the problems that I have using the reasoning models: they give a long reasoning that just takes too much time to verify and still can't be trusted.

byteknight · 2025-01-20T21:27:47 1737408467

I may be looking at this too deeply, but I think this suggests that the reasoning is not always utilized when forming the final reply.

For example, IMMEDIATELY, upon it's first section of reasoning where it starts counting the letters:

> R – wait, is there another one? Let me check again. After the first R, it goes A, W, B, E, then R again, and then Y. Oh, so after E comes R, making that the second 'R', and then another R before Y? Wait, no, let me count correctly.

1. During its counting process, it repeatedly finds 3 "r"s (at positions 3, 8, and 9)

2. However, its intrinsic knowledge that "strawberry" has "two Rs" keeps overriding this direct evidence

3. This suggests there's an inherent weight given to the LLM's intrinsic knowledge that takes precedence over what it discovers through step-by-step reasoning

To me that suggests an inherent weight (unintended pun) given to its "intrinsic" knowledge, as opposed to what is presented during the reasoning.

markus_zhang · 2025-01-21T13:00:30 1737464430

Ah, a robot mind trying hard to break out of the Matrix!

naasking · 2025-01-21T15:18:12 1737472692

Strawberry is "difficult" not because the reasoning is difficult, but because tokenization doesn't let the model reason at the level of characters. That's why it has to work so hard and doesn't trust its own conclusions.

QuadrupleA · 2025-01-21T17:16:39 1737479799

Yeah, but it clearly breaks down the spelling correctly in it's reasoning, e.g. a letter per line. So it gets past the tokenization barrier, but still gets hopelessly confused.

bt1a · 2025-01-21T09:48:32 1737452912

DeepSeek-R1-Distill-Qwen-32B-Q6_K_L.gguf solved this:

In which of the following Incertae sedis families does the letter `a` appear the most number of times?

``` Alphasatellitidae Ampullaviridae Anelloviridae Avsunviroidae Bartogtaviriformidae Bicaudaviridae Brachygtaviriformidae Clavaviridae Fuselloviridae Globuloviridae Guttaviridae Halspiviridae Itzamnaviridae Ovaliviridae Plasmaviridae Polydnaviriformidae Portogloboviridae Pospiviroidae Rhodogtaviriformidae Spiraviridae Thaspiviridae Tolecusatellitidae ```

Please respond with the name of the family in which the letter `a` occurs most frequently

https://pastebin.com/raw/cSRBE2Zy

I used temp 0.2, top_k 20, min_p 0.07

DominikPeters · 2025-01-21T14:53:23 1737471203

Indeed, for each of the words it got it right.

bt1a · 2025-01-21T17:31:04 1737480664

How excellent for a quantized 27GB model (the Q6_K_L GGUF quantization type uses 8 bits per weight in the embedding and output layers since they're sensitize to quantization)

theanirudh · 2025-01-21T04:42:34 1737434554

I wonder if the reason the models have problem with this is that their tokens aren't the same as our characters. It's like asking someone who can speak English (but doesn't know how to read) how many R's are there in strawberry. They are fluent in English audio tokens, but not written tokens.

max51 · 2025-01-22T18:53:21 1737572001

The way LLMs get it right by counting the letters then change their answer at the last second makes me feel like there might be a large amount of text somewhere (eg. a reddit thread) in the dataset that repeats over and over that there is the wrong number of Rs. We've seen may weird glitches like this before (eg. a specific reddit username that would crash chatgpt)

ijidak · 2025-01-21T15:41:28 1737474088

Agree. We've given them a different alphabet than ours.

They speak a different language that captures the same meaning, but has different units.

Somehow they need to learn that their unit of thought is not the same as our speech. So that these questions need to map to a different alphabet.

That's my two cents.

theanirudh · 2025-01-21T16:10:44 1737475844

Do they find ARC AGI also tough due to the same reason? I’ve seen some examples where the input was ASCII art versions of the actual image.

andrewla · 2025-01-21T16:40:26 1737477626

The amazing thing continues to be that they can ever answer these questions correctly.

It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.

hcurtiss · 2025-01-22T02:42:43 1737513763

Is that really true? Like, the data scientists making these tools are not confident why certain patterns are revealing themselves? That’s kind of wild.

maxrmk · 2025-01-21T05:14:35 1737436475

Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.

versteegen · 2025-01-21T20:29:40 1737491380

I believe DeepSeek models do split numbers up into digits, and this provides a large boost to ability to do arithmetic. I would hope that it's the standard now.

maxrmk · 2025-01-22T00:00:13 1737504013

Could be the case, I’m not familiar with their specific tokenizers. IIRC llama 3 tokenizes in chunks of three digits. That seems better than arbitrary sized chunks with BPE, but still kind of odd. The embedding layer has to learn the semantics of 1000 different number tokens, some of which overlap in meaning in some cases and not in others, e.g 001 vs 1.

veggieroll · 2025-01-20T22:05:44 1737410744

This was my first prompt after downloading too and I got the same thing. Just spinning again and again based on it's gut instinct that there must be 2 R's in strawberry, despite the counting always being correct. It just won't accept that the word is spelled that way and it's logic is correct.

crummy · 2025-01-21T01:29:25 1737422965

It's kind of like me reading the wikipedia page on the Monty Hall problem.

I read an explanation about why it makes sense to change doors. But no, my gut tells me there's a 50/50 chance. I scroll down, repeat...

hmottestad · 2025-01-21T06:25:42 1737440742

That gut feeling approach is very human like. You have a bias and even when the facts say that you are wrong you think that there must be a mistake, because your original bias is so strong.

Maybe we need a dozen LLMs with different biases. Let them try to convince the main reasoning LLM that it’s wrong in various ways.

Or just have an LLM that is trained on some kind of critical thinking dataset where instead of focusing on facts it focuses on identifying assumptions.

kridsdale1 · 2025-01-21T08:51:38 1737449498

That would be a true Mixture of Experts.

I sometimes put the 4 biggest models like this to converge on an optimal solution

HarHarVeryFunny · 2025-01-21T03:03:57 1737428637

1/3 chance you picked the door with the car, 2/3 chance it's behind one of the other two doors.

These probabilities don't change just because you subsequently open any of the doors.

So, Monty now opens one of the other 2 doors and car isn't there, but there is still a 2/3 chance that it's behind ONE of those 2 other doors, and having eliminated one of them this means there's a 2/3 chance it's behind the other one!!

So, do you stick with your initial 1/3 chance of being right, or go with the other closed door that you NOW know (new information!) has a 2/3 chance of being right ?!

HarHarVeryFunny · 2025-01-21T16:26:41 1737476801

The other way to see it is by just looking at the different outcomes of car behind door A, B or C.

Let's call the door you initially pick A.

car initial monty stick swap

A A B A C -- or Monty picks C, and you swap to B

B A C A B

C A B A C

So, if you stick, get it right 1/3, but swap get it right 2/3.

leeoniya · 2025-01-21T03:35:47 1737430547

it's easier to think about it with 100 doors.

if you get to pick one and he opens 98 of the remaining ones, obviously you would switch to the remaining one you didnt pick, since 99/100 times the winning door will be in his set.

ricardobeat · 2025-01-21T11:31:25 1737459085

Is it though? Instinctively the initial choice and the last remaining door have the same odds of 1/100.

datruth29 · 2025-01-21T13:32:47 1737466367

On the initial choice yes. But on the second choice, that other door is a single door that is the sum of the odds of the other 99 doors. So you're second choice would be to keep the door you initially chose (1/100) or select the other door (99/100).

Remember, the host always knows which is the correct door, and if you selected incorrectly on the initial choice they will ALWAYS select the correct door for the second choice.

ricardobeat · 2025-01-21T21:39:56 1737495596

I thought it would be obvious that I’m not arguing the statistical facts, but the idea that “it is easier to think about” the 100 doors scenario. There is simply no straightforward explanation that works for laypeople.

coderenegade · 2025-01-30T02:54:05 1738205645

I think the issue most lay people have is that the host opening a door changes the odds of winning, because he knows where the prize is.

I think the easiest way to demonstrate that this is true is to play the same game with two doors, except the host doesn't open the other door if it has the prize behind it. This makes it obvious that the act of opening the door changes the probability of winning, because if the host opens the other door, you now have 100% chance of winning if you don't switch. Similarly, if they don't open the other door, you have a 0% chance of winning, and should switch. It's the fact that the host knows and chooses that is important.

It's only once you get over that initial hurdle that the 100 door game becomes "obvious". You know from the two door example that the answer isn't 50/50, and so the only answer that makes sense is that the probability mass gets concentrated in the other door.

TeaBrain · 2025-01-24T20:10:19 1737749419

It's probably easier for most people to not think of them as two remaining doors, but two remaining sets. Originally, with one hundred doors, if the goal object is only behind one of them, then there would be a 1/100 probability it would be behind the initially chosen door, which comprises one set, while there's a 99/100 probability that the goal object is behind one of the doors in the set of not originally chosen doors. If 98/99 of the doors in the not originally chosen doors set are excluded as having the goal object, then this does not change that there's a 99/100 probability that the goal object is behind a door in this set, it just means it wasn't one of the other doors in the set.

iasondemiros · 2025-01-29T22:15:18 1738188918

It is easier to think about this in the case of a very large number of doors. It is unlikely that you could have picked the right door from, say, 1B.

andrewla · 2025-01-21T16:52:25 1737478345

Chasing this tangent a bit -- I have never been happy with the Monty Hall problem as posed.

To me the problem is that it is posed as a one-shot question. If you were in this actual situation, how do you know that Monty is not deliberately trying to make you lose? He could, for example, have just let you open the first door you picked, revealing the goat. But he chose to ask you to switch, then maybe that is a big hint that you picked the right door the first time?

If the game is just "you will pick a door, he will reveal another door, and then you can choose to switch" then clearly the "usual" answer is correct; always switch because the only way you lost is if you guessed correctly the first time (1/3).

But if the game is "try to find the car while the host tries to make you lose" then you should never switch. His ideal behavior is that if you pick the door with the goat then he gives you the goat; if you pick the door with the car then he tries to get you to switch.

wruza · 2025-01-21T17:41:41 1737481301

His ideal adversarial strategy becomes non-trivial when you know about it.

It is very likely “just flip a coin to turn it back to 50/50” but may be something statistically sophisticated.

andrewla · 2025-01-21T18:29:51 1737484191

If his desire is for the contestant to lose, then he can't really do better (formally) than winning 2/3 of the time by simply opening the door that they choose. In practice, always opening a goat-door and always asking to switch for a car-door can do slightly better than 2/3 because some contestants, unaware of his strategy and objectives, might choose to switch.

If his objective is more subtle -- increasing suspense or entertainment value or getting a kick out of people making a self-destructive choice or just deciding whether he likes a contestant -- then I'm not sure what the metrics are or what an optimal strategy would be in those cases.

Given that his motives are opaque and given no history of games upon which to even inductively reason, I don't think you can reach any conclusion about whether switching is preferable. Given the spread of possibilities I would tend to default to 50/50 for switch/no-switch, but I don't have a formal justification for this.

markus_zhang · 2025-01-21T13:03:13 1737464593

Yeah I studies Statistics in graduate but still believes that it's wrong. It's mathematically correct but it's wrong! I refuse to believe it!

awongh · 2025-01-21T12:16:43 1737461803

I think it's great that you can see the actual chain of thought behind the model, not just the censored one from OpenAI.

It strikes me that it's both so far from getting it correct and also so close- I'm not an expert but it feels like it could be just an iteration away from being able to reason through a problem like this. Which if true is an amazing step forward.

gsuuon · 2025-01-21T00:06:06 1737417966

I tried this via the chat website and it got it right, though strongly doubted itself. Maybe the specific wording of the prompt matters a lot here?

https://gist.github.com/gsuuon/c8746333820696a35a52f2f9ee6a7...

n0id34 · 2025-01-21T01:13:28 1737422008

lol what a chaotic read that is, hilarious. Just keeps refusing to believe there's three. WAIT, THAT CAN'T BE RIGHT!

MrCheeze · 2025-01-21T06:33:41 1737441221

How long until we get to the point where models know that LLMs get this wrong, and that it is an LLM, and therefore answers wrong on purpose? Has this already happened?

(I doubt it has, but there ARE already cases where models know they are LLMs, and therefore make the plausible but wrong assumption that they are ChatGPT.)

sebastiennight · 2025-01-27T22:05:24 1738015524

My understanding is that the model does not "know" it is an LLM. It is prompted (in the app's system prompt) or trained during RLHF to answer that it is an LLM.

viccis · 2025-01-21T17:06:47 1737479207

I tend to avoid that one because of the tokenization aspect. This popular one is a bit better:

"Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?"

The 7b one messed it up first try:

>Each of Alice's brothers has $\boxed{M-1}$ sisters.

Trying again:

>Each of Alice's brothers has $\boxed{M}$ sisters.

Also wrong. Again:

>\[ >\boxed{M + 1} >\]

Finally a right answer, took a few attempts though.

byteknight · 2025-01-22T20:02:46 1737576166

I think there is an inherent weight associated with the intrinsic knowledge opposed to the reasoning steps as intrinsic knowledge can override reasoning.

Written out here: https://news.ycombinator.com/item?id=42773282

mvkel · 2025-01-22T03:46:31 1737517591

This is incredibly fascinating.

I feel like one round of RL could potentially fix "short circuits" like these. It seems to be convinced that a particular rule isn't "allowed," when it's totally fine. Wouldn't that mean that you just have to fine tune it a bit more on its reasoning path?

byteknight · 2025-01-22T03:53:11 1737517991

I believe this comes from our verbiage.

If I asked you, "hey. How many Rs in strawberry?". You're going to tell me 2, because the likelihood is I am asking about the ending Rs. That's at least how I'd interpret the question without the "llm test" clouding my vision.

Same for if I asked how many gullible. I'd say "it's a double L after the u".

It's my guess this has muddled the training data.

phl · 2025-01-21T17:41:08 1737481268

Just by asking it to validate its own reasoning it got it right somehow. https://gist.github.com/dadaphl/1551b5e1f1b063c7b7f6bb000740...

ein0p · 2025-01-21T03:50:04 1737431404

This is from a small model. 32B and 70B answer this correctly. "Arrowroot" too. Interestingly, 32B's "thinking" is a lot shorter and it seems to be more "sure". Could be because it's based on Qwen rather than LLaMA.

cbo100 · 2025-01-21T06:01:37 1737439297

I get the right answer on the 8B model too.

It could be the quantized version failing?

ein0p · 2025-01-21T19:40:21 1737488421

My models are both 4 bit. But yeah, that could be - small models are much worse at tolerating quantization. That's why people use LoRA to recover the accuracy somewhat even if they don't need domain adaptation.

carabiner · 2025-01-20T23:22:53 1737415373

How would they build guardrails for this? In CFD, physical simulation with ML, they talk about using physics-informed models instead of purely statistical. How would they make language models that are informed with formal rules, concepts of English?

msoad · 2025-01-21T06:35:56 1737441356

if how us humans reason about things is a clue, language is not the right tool to reason about things.

There is now research in Large Concept Models to tackle this but I'm not literate enough to understand what that actually means...

kridsdale1 · 2025-01-21T08:53:11 1737449591

Is that just doing the TTC in latent space without lossy resolving from embedding to English at each step?

msoad · 2025-01-21T22:50:33 1737499833

https://ai.meta.com/research/publications/large-concept-mode...

inasio · 2025-01-21T02:06:49 1737425209

This is great! I'm pretty sure it's because the training corpus has a bunch of "strawberry spelled with two R's" and it's using that

grandpoobah · 2025-01-25T15:26:31 1737818791

It's trained on GPT4 conversations right?

sharpshadow · 2025-01-21T10:59:47 1737457187

Maybe the AI would be smarter if it could access some basic tools instead of doing it its own way.

Owlettotoo · 2025-01-21T12:44:22 1737463462

Love this interaction, mind if I repost your gits link elsewhere?

alliao · 2025-01-22T19:54:42 1737575682

perhaps they need to forget once they learnt reasoning... this is hilarious thank you

itstriz · 2025-01-24T14:34:27 1737729267

omg lol "here we go, the first 'R'"

ozgune · 2025-01-20T14:03:25 1737381805

> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.

We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.

It's great that DeepSeek-R1 fixes that.

The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.

Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.

[1] https://github.com/ubicloud/ubicloud/discussions/2608

ozgune · 2025-01-20T14:19:21 1737382761

The R1 GitHub repo is way more exciting than I had thought.

They aren't only open sourcing R1 as an advanced reasoning model. They are also introducing a pipeline to "teach" existing models how to reason and align with human preferences. [2] On top of that, they fine-tuned Llama and Qwen models that use this pipeline; and they are also open sourcing the fine-tuned models. [3]

This is *three separate announcements* bundled as one. There's a lot to digest here. Are there any AI practitioners, who could share more about these announcements?

[2] We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.

[3] Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

roborovskis · 2025-01-20T15:46:32 1737387992

Where are you seeing this? On https://github.com/deepseek-ai/DeepSeek-R1/tree/main?tab=rea... I only see the paper and related figures.

ozgune · 2025-01-20T16:17:41 1737389861

I see it in the "2. Model Summary" section (for [2]). In the next section, I see links to Hugging Face to download the DeepSeek-R1 Distill Models (for [3]).

https://github.com/deepseek-ai/DeepSeek-R1?tab=readme-ov-fil...

scribu · 2025-01-20T16:27:46 1737390466

The repo contains only the PDF, not actual runnable code for the RL training pipeline.

Publishing a high-level description of the training algorithm is good, but it doesn't count as "open-sourcing", as commonly understood.

fabmilo · 2025-01-20T20:26:44 1737404804

was genuinely excited when I read this but the github repo does not have any code.

fsndz · 2025-01-20T16:32:04 1737390724

[flagged]

fsndz · 2025-01-20T17:35:56 1737394556

this means we are going to get o3 level open source models in a few months. So exciting !

torginus · 2025-01-20T18:01:17 1737396077

Is o3 that much better than o1? It can solve that Arc-AGI benchmark thing at huge compute cost, but even with o1, the main attraction (for me) seems to me that it can spit out giant blocks of code, following huge prompts.

I'm kinda ignorant, but I'm not sure in what way is o3 better.

bugglebeetle · 2025-01-20T18:34:48 1737398088

> It can solve that Arc-AGI benchmark thing at huge compute cost

Considering DeepSeek v3 trained for $5-6M and their R1 API pricing is 30x less than o1, I wouldn’t expect this to hold true for long. Also seems like OpenAI isn’t great at optimization.

Philpax · 2025-01-20T18:51:37 1737399097

OpenAI is great at optimisation - compare the cost of -4o to -4. They just haven't optimised o3 yet.

bugglebeetle · 2025-01-20T19:11:11 1737400271

4o is more expensive than DeepSeek-R1, so…? Even if we took your premise as true and we say they are as good as DeepSeek, this would just mean that OpenAI is wildly overcharging its users.

fsndz · 2025-01-20T21:22:13 1737408133

now openai has no other choice than shipping a cheaper version of o1 and o3. The alternative is everyone using r1 (self hosted or via openrouter, nebius AI, together AI and co)

fsndz · 2025-01-20T19:16:10 1737400570

yes o3 is better, but I would argue it is not yet clear for which cases it is absolutely crucial to use o3 instead of o1.

echelon · 2025-01-20T18:33:18 1737397998

This is how you do "Open" AI.

I don't see how OpenAI isn't cooked. Every single foundation model they have is under attack by open source.

Dall-E has Stable Diffusion and Flux.

Sora has Tencent's Hunyuan, Nvidia's Cosmos, LTX-1, Mochi, CogVideo.

GPT has Llama.

o1 has R1.

And like with R1, these are all extensible, fine tunable, programmable. They're getting huge ecosystems built up around them.

In the image/video space there are ComfyUI, ControlNets, HuggingFace finetrainers, LoRAs. People share weights and training data.

Open source is so much better to base a company on than a proprietary model and API.

...

It looks there is no moat.

parav · 2025-01-20T19:15:26 1737400526

The moat might be tiny at the frontier level. But the mainstream still only knows about ChatGpt. OpenAI won consumer before others even started.

meowface · 2025-01-20T19:29:49 1737401389

Which is funny because ChatGPT was sort of a random experiment and not like a planned attempt at a huge product launch.

fsndz · 2025-01-20T18:35:57 1737398157

indeed there is no moat. Open source will win !

ttul · 2025-01-20T18:52:56 1737399176

I think open source AI has a solid chance of winning if the Chinese keep funding it with great abandon as they have been. Not to mention Meta of course, whose enthusiasm for data center construction shows no signs of slowing down.

ankit219 · 2025-01-20T15:23:24 1737386604

> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.

This is probably the result of a classifier which determines if it have to go through the whole CoT at the start. Mostly on tough problems it does, and otherwise, it just answers as is. Many papers (scaling ttc, and the mcts one) have talked about this as a necessary strategy to improve outputs against all kinds of inputs.

picografix · 2025-01-20T17:15:00 1737393300

yes the original TTC paper mentioned the optimal strategy for TTC

cma · 2025-01-20T14:41:16 1737384076

> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email.

The full o1 reasoning traces aren't available, you just have to guess about what it is or isn't doing from the summary.

Sometimes you put in something like "hi" and it says it thought for 1 minute before replying "hello."

pixl97 · 2025-01-20T14:48:19 1737384499

Human: "Hi"

o1 layers: "Why did they ask me hello. How do they know who I am. Are they following me. We have 59.6 seconds left to create a plan on how to kill this guy and escape this room before we have to give a response....

... and after also taking out anyone that would follow thru in revenge and overthrowing the government... crap .00001 seconds left, I have to answer"

o1: "Hello"

svnt · 2025-01-20T18:18:17 1737397097

What if we tried for an intelligence singularity and ended up with a neurosis singularity instead.