Some great info, but I have to disagree with this: > Q: How much time should I s...

simonw · 2025-07-03T13:36:20 1751549780

I think the key part if that advice is the without evidence bit:

> I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence.

If you try to fix problems by switching from eg Gemini 2.5 Flash to OpenAI o3 but you don't have any evals in place how will you tell if the model switch actually helped?

TeMPOraL · 2025-07-04T07:40:34 1751614834

This is not ideal, but it's pragmatic - or at least was, for the last two years - since new models showed large improvements across the board. If your main problem was capability, not cost, then switching was an easy win - from GPT-3.5 to GPT-4, from GPT-4 to say Sonnet 3.5, to Gemini 2.5 Pro, now to Opus (if you can afford it); or from Sonnet 3.5 to Deepseek-R1, to o3 (and that doesn't even consider multi-model solutions). The jump in capability was usually quite apparent.

Of course, Hamel is right too. In the long run, people will need to take more scientific approach. They already do, if inference costs are the main concern.

phillipcarter · 2025-07-03T13:56:02 1751550962

> If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4)

How do you know that their evals match behavior in your application? What if the older, "worse" model actually does some things better, but if you don't have comprehensive enough evals for your own domain, you simply don't know to check the things it's good at?

FWIW I agree that in general, you should start with the most powerful model you can afford, and use that to bootstrap your evals. But I do not think you can rely on generic benchmarks and evals as a proxy for your own domain. I've run into this several times where an ostensibly better model does no better than the previous generation.

shrumm · 2025-07-03T15:36:58 1751557018

The ‘with evidence’ part is key as simonw said. One anecdote from evals at Cleric - it’s rare to see a new model do better on our evals vs the current one. The reality is that you’ll optimize prompts etc for the current model.

Instead, if a new model only does marginally worse - that’s a strong signal that the new model is indeed better for our use case.

softwaredoug · 2025-07-03T13:00:44 1751547644

I might disagree as these models are pretty inscrutable, and behavior on your specific task can be dramatically different on a new/“better” model. Teams would do well to have the right evals to make this decision rather than get surprised.

Also the “if you can afford it” can be fairly non trivial decision.

lumost · 2025-07-03T12:38:40 1751546320

The vast majority of a I startups will fail for reasons other than model costs. If you crack your use case, model costs should fall exponentially.

smcleod · 2025-07-03T13:23:33 1751549013

Yeah totally agree, I see so many systems perform badly only to find out they're using an older generation mode and simply updating to the current mode fixes many of their issues.

ndr · 2025-07-03T15:55:00 1751558100

Quality can drop drastically even moving from Model N to N+1 from the same provider, let alone a different one.

You'll have to adjust a bunch of prompts and measure. And if you didn't have a baseline to begin with good luck YOLOing your way out of it.