haaz's comments

haaz · 2025-08-05T16:46:38 1754412398

it is barely an improvement according to their own benchmarks. not saying thats a bad thing, but not enough for anybody to notice any difference

waynenilsen · 2025-08-05T16:50:32 1754412632

i think its probably mostly vibes but that still counts, this is not in the charts

> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

esafak · 2025-08-06T03:48:50 1754452130

That is a big improvement.

ttoinou · 2025-08-05T17:00:33 1754413233

That's why they named it 4.1 and not 4.5

zamadatix · 2025-08-05T17:35:23 1754415323

When it's "that's why they incremented the version by a tenth instead of a half" you know things have really started to slow for the large models.

phonon · 2025-08-05T17:46:12 1754415972

Opus 4 came out 10 weeks ago. So this is basically one new training run improvement.

zamadatix · 2025-08-05T18:11:34 1754417494

And in 52 weeks we've gone 3.5->4.1 with this training improvement, meanwhile the 52 weeks prior to that were Claude -> Claude 3. The absolute jumps per version delta also used to be larger.

I.e. it seems we don't get much more than new training run levels of improvement anymore. Which is better than nothing, but a shame compared to the early scaling.

globalise83 · 2025-08-05T20:06:35 1754424395

Is it really a bigger jump to go from plausible to frequently useful, than from frequently useful to indispensable?

zamadatix · 2025-08-05T21:22:52 1754428972

Why is there supposed to be no step between frequently useful and indispensable? Quickly going from nothing to frequently useful (which involved many rapid hops between) was certainly surprising, and that's precisely the lost momentum.

mclau157 · 2025-08-05T18:41:50 1754419310

They released this because competitors are releasing things

gloosx · 2025-08-05T18:19:48 1754417988

They need to leave some room to release 10 more models. They could crank benchmarks to 100% but then no new model is needed lol? Pretty sure these pretty benchmark graphs are all completely staged marketing numbers since they do solve the same problems they are being trained on – no novel or unknown problematic is presented to them.

Topfi · 2025-08-05T20:29:41 1754425781

I am still very early, but output quality wise, yes, there does not seem to be any noticeable improvement in my limited personal testing suite. What I have noticed though is subjectively better adherence to instructions and documentation provided outside the main prompt, though I have no way to quantify or reliably test that yet. So beyond reliably finding Needles-in-the-Haystack (which Frontier models have done well on lately), Opus 4.1 seems to do better in following those needles even if not explicitly guided to compared to Opus 4.

onlyrealcuzzo · 2025-08-05T20:03:33 1754424213

I will only add that it's interesting that in the results graphic, they simply highlighted Opus 4.1 - choosing not to display which models have the best scores - as Opus 4.1 only scored the best on about half of the benchmarks - and was worse than Opus 4.0 on at least one measure.

levocardia · 2025-08-05T19:33:37 1754422417

"You pay $20/mo for X, and now I'm giving you 1.05*X for the same price." Outrageous!

leetharris · 2025-08-05T17:13:38 1754414018

Good! I'm glad they are just giving us small updates. Opus 4 just came out, if you have small improvements, why not just release them? There's no downside for us.

AstroBen · 2025-08-05T17:15:37 1754414137

I don't think this could even be called an improvement? It's small enough that it could just be random chance

j_bum · 2025-08-05T17:31:50 1754415110

I’ve always wondered about this actually. My assumption is that they always “pick the best” result from these tests.

Instead, ideally they’d run the benchmark tests many times, and share all of the results so we could make statistical determinations.

haaz · on Dec 6, 2024

Public transport simply does not work at low densities, and improves with density. Car based societies are, almost by definition, not walkable. Dense, walkable cities are cheaper, healthier, prettier, greener, and more productive.

There are problems that come with density, like crime, but there are solutions to that other than low density. I encourage you to explore life in Asian cities, especially SE China or Singapore, to see how great low crime, high density cities can really be.

haaz · on Dec 6, 2024

When I worked at a large bank they blocked ChatGPT on the network. Unfortunately I was a new grad who didn’t know Java working in a Java team. I just turned off the vpn and copy pasted the code back and forth until it worked. Boss didn’t seem to mind. Left the job after 2 months anyway.

pwdisswordfishz · on Dec 6, 2024

> I was a new grad who didn’t know Java working in a Java team

All the more reason not to use ChatGPT then.

haaz · on Dec 3, 2024

Very cool, thanks for sharing. How did you train it? Just manually labeling the data?

asfarley · on Dec 3, 2024

I hired some workers in Bangladesh

haaz · on May 15, 2024

Seems similar to plotly dash, no?

louisjoejordan · on May 16, 2024

The biggest difference I see (though I'm not super familiar with Plotly) is that we define data transformations in SQL, while Plotly uses Python. One benefit of SQL is that it provides the advantage of tracing data lineage from source to visualization, which gives you visibility into data dependencies - something that Python code in Plotly Dash doesn't offer.

haaz · on March 30, 2024

Doubt it. If somebody attacked my dog I’d shoot them, but not if they attacked my hoover.

Teever · on March 30, 2024

This scenario has almost nothing to do with a scenario that includes armed law enforcement.

exe34 · on March 30, 2024

You sound too reasonable, you wouldn't be able to work in law enforcement.

haaz · on Feb 2, 2024

Great guide, especially for beginners/laymen.

gumby · on Feb 2, 2024

I consider these advanced splices. Someone fixing a power cord in their house need not go to such extremes.

onetimeuse92304 · on Feb 2, 2024

I always splice wires considering these good practices, regardless of what application. I just like doing things correctly and reliably.

I also have a little bit of a fetish. I really enjoy it when other people do their work correctly and do those small things that show their prowess. And I like to to think somebody someday sees what I did and will enjoy it the same.

That's why, when I sail, I always have all my lines neatly tied and organised. And use proper knots for their applications. And look at other yachts when docked to see if their owners know what they are doing or not.

semi-extrinsic · on Feb 2, 2024

Do you knoll when building Lego?

onetimeuse92304 · on Feb 2, 2024

These days I leave Lego for my kids.

gottorf · on Feb 2, 2024

Just a reminder to all homeowners that the other extreme is also bad; you can't just wire-tape two Romex cables together in an attic and call it a day!

dylan604 · on Feb 2, 2024

I wish someone had told the person that did nearly this very thing to the low voltage control cable to my external AC compressor. At least in the attic it would not be exposed to weather and UV. <facepalmEmoji>

haaz · on Jan 11, 2024

Unbelievable, thanks so much. I’ll definitely use this for my job hunt.

barefootsanders · on Jan 12, 2024

Glad you like it! If you have any other suggestions or feedback, feel free to share. Aloha

haaz · on Jan 11, 2024

This is why we’ll never have an all in one wechat app in the west.

haaz · on Jan 11, 2024

Thank you for setting this up, it was incredibly helpful when I moved to Berlin