That phenomenon and others is what made it obvious that COT is not its "thinking". I think COT is a process by which the llm expands its processing boundary, in that it allows it to sample over a larger space of possibilities. So its kind of acts like a "trigger" of sorts that allows the model to explore in more ways then without COT. First time I saw this was when I witnessed the "wait" phenomenon. Simply inducing the model to say "wait" in its response improved accuracy of results. as now the model double checked its "work". funny enough it also sometimes lead it to produce a wrong answer where otherwise it should have stuck to its guns. But overall that little wait had a net positive affect. Thats when i knew COT was not same as human thinking as we dont care about trigger words or anything like that, our thinking requires zero language (though it does benefit from language) its a deeper process. Thats why i was interested in latent processing models and foray in that matter.
I've had this issues as well since codex models were introduced. i tried them but 5.1 regular on high thinking always worked better for me. I think its because its thinking is deeper and more nuanced it seemed to understand better what needed doing. I did have to interact more often with it versus Codex which just worked for a long time by itself, but those interactions were worth it in reduction of assumptions and other stuff Codex made. Im gonna try 5,2 Codex today and hope that changes, but so far I've been happy with base 5.1 high thinking.
Knowing Googles MO, its most likely not the model but their harness system that's the issue. God they are so bad at their UI and agentic coding harnesses...
There doesn't need to be a correlation between some data structure and its effects for people to implement some sort of feature. There only need to be enough stupid people in powerful positions that believe in some sort of correlational trend AND also for the data gathering task to be trivially cheap enough for them to implement said things. And there's no shortage of that going around. That's why these technologies are dangerous. Stupid people with powerful and cheap tools to weald them. Kind of like what we saw with the first wave of Facebook algorithms being used against its users to maximize the attention at the detriment of everything else.
Yes, exactly. “Well, the AI said to go arrest that guy, and I’ve been hearing for years that AI is super smart, so that must be the right thing to do.”
This seems like one of those devices that seems like "meh" at a glance but grows on you once you used it. In fact just the Bluetooth button feature alone is warranted a second take let alone a mic embedded in to the ring with a crazy battery life. If there's a way to hack the device and pipe the mic features to other apps I think i might get this thing. edit: never mind i just noticed 15 hours recording time with no recharging. yeah bud that's a no go.
determinism v nondeterminism is and has never been an issue. also all llms are 100% deterministic, what is non deterministic are the sampling parameters used by the inference engine. which by the way can be easily made 100% deterministic by simply turning off things like batching. this is a matter for cloud based api providers as you as the end user doesnt have acess to the inferance engine, if you run any of your models locally in llama.cpp turning off some server startup flags will get you the deterministic results. cloud based api providers have no choice but keeping batching on as they are serving millions of users and wasting precious vram slots on a single user is wasteful and stupid. see my code and video as evidence if you want to run any local llm 100% deterministocally https://youtu.be/EyE5BrUut2o?t=1
That's not an interesting difference, from my point of view. The box m black box we all use is non deterministic, period. Doesn't matter where on the inside the system stops being deterministic: if I hit the black box twice, I get two different replies. And that doesn't even matter, which you also said.
The more important property is that, unlike compilers, type checkers, linters, verifiers and tests, the output is unreliable. It comes with no guarantees.
One could be pedantic and argue that bugs affect all of the above. Or that cosmic rays make everything unreliable. Or that people are non deterministic. All true, but the rate of failure, measured in orders of magnitude, is vastly different.
My man did you even check my video, did you even try the app. This is not "bug related" nowhere did i say it was a bug. Batch processing is a FEATURE that is intentionally turned on in the inference engine for large scale providers. That does not mean it has to be on. If they turn off batch processing al llm api calls will be 100% deterministic but it will cost them more money to provide the services as now you are stuck with providing 1 api call per GPU. "if I hit the black box twice, I get two different replies" what you are saying here is 100% verifiably wrong. Just because someone chose to turn on a feature in the inference engine to save money does not mean llms are anon deterministic. LLM's are stateless. their weights are froze, you never "run" an LLM, you can only sample it. just like a hologram. and depending on the inference sampling settings you use is what determines the outcome.....
Correct me if I'm wrong, but even with batch processing turned off, they are still only deterministic as long as you set the temperature to zero? Which also has the side-effect of decreasing creativity. But maybe there's a way to pass in a seed for the pseudo-random generator and restore determinism in this case as well. Determinism, in the sense of reproducible. But even if so, "determinism" means more than just mechanical reproducibility for most people - including parent, if you read their comment carefully. What they mean is: in some important way predictable for us humans. I.e. no completely WTF surprises, as LLMs are prone to produce once in a while, regardless of batch processing and temperature settings.
You can change ANY sampling parameter once batch processing is off and you will keep the deterministic behavior. temperature, repetition penalty, etc.... I got to say I'm a bit disappointed in seeing this in hacker news, as I expect this from reddit. you bring the whole matter on a silver platter, the video describes in detail how any sampling parameter can be used, i provide the whole code opensource so anyone can try it themselves without taking my claims as hearsay, well you can bring a horse to water as they say....
Its not a magic technology, they can only represent data they were trained on. Naturally most represented data in their training data is NOT conversational. Consider that such data is very limited and who knows how it was labeled if at all during pretraining. But with that in mind, LLM's definitely can do all the things you describe, but a very robust and well tested system prompt has to be used to coax this behavior out. Also a proper model has to be used, as some models are simply not trained for this type of interaction.