At the high level, you asked LLM to translate N lines of code to maybe 2N lines of code, while GP asked LLM to translate N lines of English to possibly 10N lines of code. Very different scenarios.
The OP said the LLM didn't build anything, said it was great, and didn't even compile it. My experience has been far the opposite: not only compiling it and fixing compile time errors but also running it and fixing runtime issues as well. Even going so far as to write waveform analysis tools in Python (the output of this project was WAV files) to determine the issues.
It doesn't really matter what we told it do; a task is a task. But clearly how each LLM performed that task very different for me than the OP.
I'll be the first to say I've abandoned a chat and started a new one to get the result I want. I don't see that as a net negative though -- that's just how you use it.
After the latest production issue, I have a feeling that opus-4.5 and gpt-5.1-codex-max are perhaps better than me at debugging. Indeed my role was relegated to combing through the logs, finding the abnormal / suspicious ones, and feeding those to the models.
It is the reasoning. During the reasoning process, the top few tokens have very similar or even same logprobs. With gpt-oss-120b, you should be able to get deterministic output by turning off reasoning, e.g. by appending:
Good call—reasoning token variance is likely a factor, esp with logprob clustering at T=0. Your <think></think> workaround would work, but we need reasoning intact for financial QA accuracy.
Also the mistral medium model we tested had ~70% deterministic outputs across the 16 runs for the text to sql gen and summarization in json tasks- and it had reasoning on. Llama 3.3 70b started to degrade and doesn’t have reasoning. But it’s a relevant variable to consider
Somehow that article totally ignored the insane pricing of cached input tokens set by Anthropic and OpenAI. For agentic coding, typically 90~95% of the inference cost is attributed to cached input tokens, and a scrappy China company can do it almost for free: https://api-docs.deepseek.com/news/news0802
> But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure.
I guess they don't have a recovery procedure for the "congestive collapse" edge case. I have seen something similar, so I wouldn't be frowning at this.
2. Having DynamoDB as a dependency of this DWFM service, instead of something more primitive like Chubby. Need to learn more about distributed systems primitives from https://www.youtube.com/watch?v=QVvFVwyElLY
This looks impressive. As someone who is not familiar with ML, I do have a question -- surely in 2025 there must be a way to schedule a large pytorch job across multiple k8s clusters? EKS and GKE already provide VPC native flat network by default .
The issue isn’t so much scheduling as it is stability.
More clusters means one more layer of things that can crash your (very expensive) training.
You also then still need to write tooling to manage cross cluster trainings correctly just as starting/stopping roughly at the same time, resuming from checkpoints, node health monitoring etc.
Nothing dealbreaking, but if it could just work in a single cluster that would be nicer.
So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:
> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.
This matches with the calculation I did for GLM-4.5 (355B A32B):
In [14]: 356732107008 - (775946240 * 2) # token_embd / output are 775946240 each. assume omitted
Out[14]: 355180214528
In [15]: 356732107008 - 339738624000 - (775946240 * 2) # parameters that are always active
Out[15]: 15441590528
In [16]: 339738624000 * 8 / 160 # parameters from activated experts
Out[16]: 16986931200.0
Meanwhile, GPT OSS series includes both the embedding layer and the output layer when counting the total parameters, but only includes the output layer when counting the active parameters:
> We refer to the models as “120b” and “20b” for simplicity, though they technically have 116.8B and 20.9B parameters, respectively. Unembedding parameters are counted towards active, but not embeddings.
And Qwen3 series includes both the embedding layer and the output layer when counting both the total parameters and the active parameters.
Why there is no standard in counting? Which approach is more accurate?
I'd say it depends. For the total parameter count, you should just count all parameters, since that's what matters for memory requirements.
For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.