More

throwdbaaway · 2025-12-11T08:58:43 1765443523

At the high level, you asked LLM to translate N lines of code to maybe 2N lines of code, while GP asked LLM to translate N lines of English to possibly 10N lines of code. Very different scenarios.

wvenable · 2025-12-12T06:45:11 1765521911

The OP said the LLM didn't build anything, said it was great, and didn't even compile it. My experience has been far the opposite: not only compiling it and fixing compile time errors but also running it and fixing runtime issues as well. Even going so far as to write waveform analysis tools in Python (the output of this project was WAV files) to determine the issues.

It doesn't really matter what we told it do; a task is a task. But clearly how each LLM performed that task very different for me than the OP.

mucha · 2025-12-12T14:18:04 1765549084

LLMs are non-deterministic for everyone. Give it time.

wvenable · 2025-12-12T15:57:36 1765555056

I'll be the first to say I've abandoned a chat and started a new one to get the result I want. I don't see that as a net negative though -- that's just how you use it.

throwdbaaway · 2025-12-07T14:33:38 1765118018

After the latest production issue, I have a feeling that opus-4.5 and gpt-5.1-codex-max are perhaps better than me at debugging. Indeed my role was relegated to combing through the logs, finding the abnormal / suspicious ones, and feeding those to the models.

throwdbaaway · 2025-12-02T11:22:30 1764674550

That DSML in the encoding directory looks quite a bit different from the Harmony chat template.

throwdbaaway · 2025-11-12T21:37:00 1762983420

It is the reasoning. During the reasoning process, the top few tokens have very similar or even same logprobs. With gpt-oss-120b, you should be able to get deterministic output by turning off reasoning, e.g. by appending:

    {"role": "assistant", "content": "<think></think>"}

Of course, the model will be less capable without reasoning.

raffisk · 2025-11-12T21:57:27 1762984647

Good call—reasoning token variance is likely a factor, esp with logprob clustering at T=0. Your <think></think> workaround would work, but we need reasoning intact for financial QA accuracy.

Also the mistral medium model we tested had ~70% deterministic outputs across the 16 runs for the text to sql gen and summarization in json tasks- and it had reasoning on. Llama 3.3 70b started to degrade and doesn’t have reasoning. But it’s a relevant variable to consider

throwdbaaway · 2025-11-06T22:21:00 1762467660

Somehow that article totally ignored the insane pricing of cached input tokens set by Anthropic and OpenAI. For agentic coding, typically 90~95% of the inference cost is attributed to cached input tokens, and a scrappy China company can do it almost for free: https://api-docs.deepseek.com/news/news0802

throwdbaaway · 2025-10-24T06:22:40 1761286960

> But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure.

I guess they don't have a recovery procedure for the "congestive collapse" edge case. I have seen something similar, so I wouldn't be frowning at this.

A couple of red flags though:

1. Apparent lack of load-shedding support by this DWFM, such that a server reboot had to be performed. Need to learn from https://aws.amazon.com/builders-library/using-load-shedding-...

2. Having DynamoDB as a dependency of this DWFM service, instead of something more primitive like Chubby. Need to learn more about distributed systems primitives from https://www.youtube.com/watch?v=QVvFVwyElLY

throwdbaaway · 2025-10-20T14:07:08 1760969228

Actually I am keen to know how Roblox got impacted. Following the terrible Halloween Outage in 2021, they posted 2 years ago about migrating to a cell based architecture in https://corp.roblox.com/newsroom/2023/12/making-robloxs-infr...

Perhaps some parts of the migration haven't been completed, or there is still a central database in us-east1

throwdbaaway · 2025-10-19T08:35:24 1760862924

This looks impressive. As someone who is not familiar with ML, I do have a question -- surely in 2025 there must be a way to schedule a large pytorch job across multiple k8s clusters? EKS and GKE already provide VPC native flat network by default .

sailingparrot · 2025-10-19T12:26:24 1760876784

The issue isn’t so much scheduling as it is stability.

More clusters means one more layer of things that can crash your (very expensive) training.

You also then still need to write tooling to manage cross cluster trainings correctly just as starting/stopping roughly at the same time, resuming from checkpoints, node health monitoring etc.

Nothing dealbreaking, but if it could just work in a single cluster that would be nicer.

throwdbaaway · 2025-09-17T15:54:53 1758124493

Huh I just moved to AMDVLK yesterday, after learning that it has 50% more PP on llama.cpp compared to RADV: https://www.reddit.com/r/LocalLLaMA/comments/1nabcek/comment...

throwdbaaway · 2025-08-12T10:53:42 1754996022

So GLM-4.5 series omits the embedding layer and the output layer when counting both the total parameters and the active parameters:

> When counting parameters, for GLM-4.5 and GLM-4.5-Air, we include the parameters of MTP layers but not word embeddings and the output layer.

This matches with the calculation I did for GLM-4.5 (355B A32B):

    In [14]: 356732107008 - (775946240 * 2) # token_embd / output are 775946240 each. assume omitted
    Out[14]: 355180214528

    In [15]: 356732107008 - 339738624000 - (775946240 * 2) # parameters that are always active
    Out[15]: 15441590528

    In [16]: 339738624000 * 8 / 160 # parameters from activated experts
    Out[16]: 16986931200.0

Meanwhile, GPT OSS series includes both the embedding layer and the output layer when counting the total parameters, but only includes the output layer when counting the active parameters:

> We refer to the models as “120b” and “20b” for simplicity, though they technically have 116.8B and 20.9B parameters, respectively. Unembedding parameters are counted towards active, but not embeddings.

And Qwen3 series includes both the embedding layer and the output layer when counting both the total parameters and the active parameters.

Why there is no standard in counting? Which approach is more accurate?

atq2119 · 2025-08-12T12:43:53 1755002633

I'd say it depends. For the total parameter count, you should just count all parameters, since that's what matters for memory requirements.

For activated parameters: All unembedding parameters are used in every inference step during token generation, but only one column of the embeddings is used (if done right). So count accordingly, since that's what matters for memory bandwidth and therefore latency.

throwdbaaway · 2025-08-13T06:00:14 1755064814

That makes sense, thanks for the info. Here's a quick recap of the recent MoE models based on the criteria..

correct activated params:

  * DeepSeek V3/R1 series
  * Kimi K2
  * GPT-OSS series

undercount activated params:

  * GLM-4.5 series

overcount activated params:

  * DeepSeek V2 series
  * Qwen3 series
  * Ernie 4.5 series
  * Hunyuan A13B