More

raghavbali · 2025-04-22T12:29:09 1745324949

Can't express it more clearly than this. Data structures are just one part of the story not the only spot where the rubber meets the road IMO too. But going back to top of the thread, for new projects it is indeed steps 2 and 3 that consume most time not step 3

raghavbali · 2025-01-28T09:19:08 1738055948

Coool. I remember when the OG pebble launched but I couldn't get one for myself (it wasn't available in my region and my pocket money didn't allow for it either ;) ). Looking forward to this #bitesNailsFuriously

raghavbali · 2025-01-28T09:15:37 1738055737

> Unfortunately if you naively quantize all layers to 1.58bit, you will get infinite repetitions in seed 3407: “Colours with dark Colours with dark Colours with dark Colours with dark Colours with dark” or in seed 3408: “Set up the Pygame's Pygame display with a Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's”.

This is really interesting insight (although other works cover this as well). I am particularly amused by the process by which the authors of this blog post arrived at these particular seeds. Good work nonetheless!

danielhanchen · 2025-01-28T10:41:53 1738060913

Hey! :) Coincidentally the seeds I always use are 3407, 3408 and 3409 :) 3407 because of https://arxiv.org/abs/2109.08203

I also tried not setting the seeds, but the results are still the same - quantizing all layers seems to make the model forget and repeat everything - I put all examples here: https://docs.unsloth.ai/basics/deepseek-r1-dynamic-1.58-bit#...

iamnotagenius · 2025-01-28T14:13:01 1738073581

would be great to have dynamic quants of V3-non-R1 version, as for some tasks it is good enough. Also would be very interesting to see degradation with dynamic quants on small/medium size MoEs, such as older Deepseek models, Mixtrals, IBM tiny Granite MoE. Would be fun if Granite 1b MoE will still be functioning at 1.58bit.

danielhanchen · 2025-01-28T21:32:56 1738099976

Oh yes multiple people have asked me about this - I'll see what I can do :)

littlestymaar · 2025-01-28T09:55:47 1738058147

Can't this kind of repetition be dealt with at the ~~decoder~~ (edit: sampler) level, like for any models? (see DRY ~~decoder~~ sampler for instance: https://github.com/oobabooga/text-generation-webui/pull/5677)

danielhanchen · 2025-01-28T10:45:43 1738061143

Oh yes one could provide a repetition penalty for example - the issue is it's not just repetition that's the issue. I find it rather forgets what it already saw, and so hence it repeats stuff - it's probably best to backtrack, then delete the last few rows in the KV cache.

Another option is to employ min_p = 0.05 to force the model not to generate low prob tokens - it can help especially in the case when the 1.58bit model generates on average 1/8000 tokens or so an "incorrect" token (for eg `score := 0`)

reichardt · 2025-01-28T10:51:36 1738061496

You likely mean sampler, not decoder. And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy. If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.

littlestymaar · 2025-01-28T11:48:19 1738064899

> You likely mean sampler, not decoder.

Indeed, that's posting before being fully awake.

> And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy.

OF course you can't recover any accuracy, but LLM are in fact prone to this kind of repetition no matter what, this is a known failure mode that's why samplers aimed at avoiding this have been designed over the past few years.

> If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.

But it won't necessary be gibberish! even a highly quantized R1 has still much more embedded information than a 14 or even 32B model, so I don't see why it should output more gibberish than smaller models.

ErikBjare · 2025-01-28T10:41:17 1738060877

You can deal with this through various sampling methods, but it doesn't actually fix the fried model.

raghavbali · 2025-01-24T16:21:30 1737735690

Maybe I missed something but this is a round about way of doing things where an embedding + ML classifier would have done the job. We don't have to use an LLM just because it can be used IMO

raghavbali · 2024-08-28T14:03:42 1724853822

Nicely summarised. Another important thing that clearly standsout (not to undermine the efforts and work gone into this) is the fact that more and more we are now seeing larger and more complex building blocks emerging (first it was embedding models then encoder decoder layers and now whole models are being duck-taped for even powerful pipelines). AI/DL ecosystem is growing on a nice trajectory.

Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).

PS: Not great examples, but I hope you get the idea ;)

raghavbali · 2024-08-26T10:47:21 1724669241

why not fix the calculator in a way that avoids/mitigates scenarios where users get to wrong quotes and then do an A/B test? This setup seemingly tilts towards some sort of a dark pattern IMO

M95D · 2024-08-26T10:58:06 1724669886

Because the results were probably wrong because the inputs were wrong (exagerated by over-cautious users). There is no automated way to avoid that in a calculator; only a conversation with a real person (sales, tech support) will reveal the bad inputs.

bee_rider · 2024-08-26T17:08:38 1724692118

I wonder if some of that could have been automated. Have a field to indicate if you are an individual, small business, or large business, and then at least flag fields that seem unusually high (or low, don’t want to provide too-rosy estimates) for that part of the market.

hamdouni · 2024-08-26T12:16:37 1724674597

They tried to mitigate :

   But any attempt to address one source of confusion inevitably added another.

raghavbali · 2024-08-22T07:59:19 1724313559

Yet to go through in detail but this is really powerful. Initiatives such as these are what we need to further democratize DL. Kudos team

AllenHW · 2024-08-23T10:48:36 1724410116

Thank you! We definitely stand by broader adoption of DL :)

raghavbali · 2024-08-21T07:45:35 1724226335

the browser wars are heating up again! nice

whywhywhywhy · 2024-08-21T12:36:05 1724243765

The browser window chrome[1] wars*

1:(As in the pre Chrome meaning of the word)

raghavbali · 2024-07-24T12:51:04 1721825464

This is such an amazing way to teach so many of engineering topics. Time to dust off that mechanix kit!

raghavbali · on Aug 22, 2022

Thank you for the nohello.net thing, I am usually pretty awkward when it comes to starting conversations and but I guess I never paid attention to why that was the case. The discussion on this thread clears out the impression I had that it is usually rude to directly jumping to the question/task. I got my queue! :)