Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Some great info, but I have to disagree with this:

> Q: How much time should I spend on model selection?

> Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”

If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your system pretty easily. Use the best models you can, if you can afford it.



I think the key part if that advice is the without evidence bit:

> I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence.

If you try to fix problems by switching from eg Gemini 2.5 Flash to OpenAI o3 but you don't have any evals in place how will you tell if the model switch actually helped?


This is not ideal, but it's pragmatic - or at least was, for the last two years - since new models showed large improvements across the board. If your main problem was capability, not cost, then switching was an easy win - from GPT-3.5 to GPT-4, from GPT-4 to say Sonnet 3.5, to Gemini 2.5 Pro, now to Opus (if you can afford it); or from Sonnet 3.5 to Deepseek-R1, to o3 (and that doesn't even consider multi-model solutions). The jump in capability was usually quite apparent.

Of course, Hamel is right too. In the long run, people will need to take more scientific approach. They already do, if inference costs are the main concern.


> If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4)

How do you know that their evals match behavior in your application? What if the older, "worse" model actually does some things better, but if you don't have comprehensive enough evals for your own domain, you simply don't know to check the things it's good at?

FWIW I agree that in general, you should start with the most powerful model you can afford, and use that to bootstrap your evals. But I do not think you can rely on generic benchmarks and evals as a proxy for your own domain. I've run into this several times where an ostensibly better model does no better than the previous generation.


The ‘with evidence’ part is key as simonw said. One anecdote from evals at Cleric - it’s rare to see a new model do better on our evals vs the current one. The reality is that you’ll optimize prompts etc for the current model.

Instead, if a new model only does marginally worse - that’s a strong signal that the new model is indeed better for our use case.


I might disagree as these models are pretty inscrutable, and behavior on your specific task can be dramatically different on a new/“better” model. Teams would do well to have the right evals to make this decision rather than get surprised.

Also the “if you can afford it” can be fairly non trivial decision.


The vast majority of a I startups will fail for reasons other than model costs. If you crack your use case, model costs should fall exponentially.


Yeah totally agree, I see so many systems perform badly only to find out they're using an older generation mode and simply updating to the current mode fixes many of their issues.


Quality can drop drastically even moving from Model N to N+1 from the same provider, let alone a different one.

You'll have to adjust a bunch of prompts and measure. And if you didn't have a baseline to begin with good luck YOLOing your way out of it.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: