Can't express it more clearly than this. Data structures are just one part of the story not the only spot where the rubber meets the road IMO too. But going back to top of the thread, for new projects it is indeed steps 2 and 3 that consume most time not step 3
Coool. I remember when the OG pebble launched but I couldn't get one for myself (it wasn't available in my region and my pocket money didn't allow for it either ;) ). Looking forward to this #bitesNailsFuriously
> Unfortunately if you naively quantize all layers to 1.58bit, you will get infinite repetitions in seed 3407: “Colours with dark Colours with dark Colours with dark Colours with dark Colours with dark” or in seed 3408: “Set up the Pygame's Pygame display with a Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's”.
This is really interesting insight (although other works cover this as well). I am particularly amused by the process by which the authors of this blog post arrived at these particular seeds. Good work nonetheless!
would be great to have dynamic quants of V3-non-R1 version, as for some tasks it is good enough. Also would be very interesting to see degradation with dynamic quants on small/medium size MoEs, such as older Deepseek models, Mixtrals, IBM tiny Granite MoE. Would be fun if Granite 1b MoE will still be functioning at 1.58bit.
Oh yes one could provide a repetition penalty for example - the issue is it's not just repetition that's the issue. I find it rather forgets what it already saw, and so hence it repeats stuff - it's probably best to backtrack, then delete the last few rows in the KV cache.
Another option is to employ min_p = 0.05 to force the model not to generate low prob tokens - it can help especially in the case when the 1.58bit model generates on average 1/8000 tokens or so an "incorrect" token (for eg `score := 0`)
You likely mean sampler, not decoder. And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy. If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.
> And no, the stronger the quantization, the more the output token probabilities diverge from the non-quantized model. With a sampler you can't recover any meaningful accuracy.
OF course you can't recover any accuracy, but LLM are in fact prone to this kind of repetition no matter what, this is a known failure mode that's why samplers aimed at avoiding this have been designed over the past few years.
> If you force the sampler to select tokens that won't repeat, you're just trading repetitive gibberish for non-repetitive gibberish.
But it won't necessary be gibberish! even a highly quantized R1 has still much more embedded information than a 14 or even 32B model, so I don't see why it should output more gibberish than smaller models.
Maybe I missed something but this is a round about way of doing things where an embedding + ML classifier would have done the job. We don't have to use an LLM just because it can be used IMO
Nicely summarised. Another important thing that clearly standsout (not to undermine the efforts and work gone into this) is the fact that more and more we are now seeing larger and more complex building blocks emerging (first it was embedding models then encoder decoder layers and now whole models are being duck-taped for even powerful pipelines). AI/DL ecosystem is growing on a nice trajectory.
Though I wonder if 10 years down the line folks wouldn't even care about underlying model details (no more than a current day web-developer needs to know about network packets).
PS: Not great examples, but I hope you get the idea ;)
why not fix the calculator in a way that avoids/mitigates scenarios where users get to wrong quotes and then do an A/B test? This setup seemingly tilts towards some sort of a dark pattern IMO
Because the results were probably wrong because the inputs were wrong (exagerated by over-cautious users). There is no automated way to avoid that in a calculator; only a conversation with a real person (sales, tech support) will reveal the bad inputs.
I wonder if some of that could have been automated. Have a field to indicate if you are an individual, small business, or large business, and then at least flag fields that seem unusually high (or low, don’t want to provide too-rosy estimates) for that part of the market.
Thank you for the nohello.net thing, I am usually pretty awkward when it comes to starting conversations and but I guess I never paid attention to why that was the case. The discussion on this thread clears out the impression I had that it is usually rude to directly jumping to the question/task. I got my queue! :)
reply