marsh_mellow's comments

marsh_mellow · 2025-04-14T17:57:46 1744653466

From OpenAI's announcement:

> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).

https://www.qodo.ai/blog/benchmarked-gpt-4-1/

arvindh-manian · 2025-04-14T17:59:57 1744653597

Interesting link. Worth noting that the pull requests were judged by o3-mini. Further, I'm not sure that 55% vs 45% is a huge difference.

marsh_mellow · 2025-04-14T18:04:02 1744653842

Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks.

55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge

servercobra · 2025-04-15T15:05:49 1744729549

Maybe not as much to us, but for people building these tools, 4.1 being significantly cheaper than Clause 3.7 is a huge difference.

elAhmo · 2025-04-15T08:07:32 1744704452

I first read it as 55% better, which sounds significantly higher than ~22% which they report here. Sounds misleading.

jsnell · 2025-04-14T18:24:34 1744655074

That's not a lot of samples for such a small effect, I don't think it's statistically significant (p-value of around 10%).

swyx · 2025-04-14T18:47:57 1744656477

is there a shorthand/heuristic to calculate pvalue given n samples and effect size?

tedsanders · 2025-04-14T23:25:42 1744673142

There are no great shorthands, but here are a few rules of thumb I use:

- for N=100, worst case standard error of the mean is ~5% (it shrinks parabolically the further p gets from 50%)

- multiply by ~2 to go from standard error of the mean to 95% confidence interval

- scale sample size by sqrt(N)

So:

- N=100: +/- 10%

- N=1000: +/- 3%

- N=10000: +/- 1%

(And if comparing two independent distributions, multiply by sqrt(2). But if they’re measured on the same problems, then instead multiply by between 1 and sqrt(2) to account for them finding the same easy problems easy and hard problems hard - aka positive covariance.)

marsh_mellow · 2025-04-14T19:17:57 1744658277

p-value of 7.9% — so very close to statistical significance.

the p-value for GPT-4.1 having a win rate of at least 49% is 4.92%, so we can say conclusively that GPT-4.1 is at least (essentially) evenly matched with Claude Sonnet 3.7, if not better.

Given that Claude Sonnet 3.7 has been generally considered to be the best (non-reasoning) model for coding, and given that GPT-4.1 is substantially cheaper ($2/million input, $8/million output vs. $3/million input, $15/million output), I think it's safe to say that this is significant news, although not a game changer

jsnell · 2025-04-14T19:38:10 1744659490

I make it 8.9% with a binomial test[0]. I rounded that to 10%, because any more precision than that was not justified.

Specifically, the results from the blog post are impossible: with 200 samples, you can't possibly have the claimed 54.9/45.1 split of binary outcomes. Either they didn't actually make 200 tests but some other number, they didn't actually get the results they reported, or they did some kind of undocumented data munging like excluding all tied results. In any case, the uncertainty about the input data is larger than the uncertainty from the rounding.

[0] In R, binom.test(110, 200, 0.5, alternative="greater")

jacobsenscott · 2025-04-15T00:00:58 1744675258

That's a marketing page for something called qodo that sells ai code reviews. At no point were the ai code reviews judged by competent engineers. It is just ai generated trash all the way down.

InkCanon · 2025-04-14T18:01:18 1744653678

>4.1 Was better in 55% of cases

Um, isn't that just a fancy way of saying it is slightly better

>Score of 6.81 against 6.66

So very slightly better

wiz21c · 2025-04-14T18:22:14 1744654934

"they found that GPT‑4.1 excels at both precision..."

They didn't say it is better than Claude at precision etc. Just that it excels.

Unfortunately, AI has still not concluded that manipulations by the marketing dept is a plague...

kevmo314 · 2025-04-14T18:05:32 1744653932

A great way to upsell 2% better! I should start doing that.

neuroelectron · 2025-04-14T22:18:35 1744669115

Good marketing if you're selling a discount all purpose cleaner, not so much for an API.

marsh_mellow · 2025-04-14T18:14:30 1744654470

I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol

55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge

kevmo314 · 2025-04-14T18:19:42 1744654782

Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.

swyx · 2025-04-14T21:02:21 1744664541

the point is oai is saying they have a viable Claude Sonnet competitor now

marsh_mellow · on Oct 24, 2024

Very cool! Could this work for detecting nearby drones?

RamboRogers · on Oct 24, 2024

I would think it would work really well for that usage case. You could tune the antenna to focus on the in use bands. The automatic baselining solves a lot of this.

marsh_mellow · on Oct 22, 2024

Anthropic blog post outlining the research process: https://www.anthropic.com/news/developing-computer-use

Computer use API documentation: https://docs.anthropic.com/en/docs/build-with-claude/compute...

Computer Use Demo: https://github.com/anthropics/anthropic-quickstarts/tree/mai...

distalx · on Oct 26, 2024

On their "Developing a computer use model" post they have mention > On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70-75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category.

Here, "next-best AI model in the same category" referes to which model.

karpatic · on Oct 22, 2024

This needs to be brought up. Was looking for the demo and ended up on the contact form

frankdenbow · on Oct 22, 2024

Thanks for these. Wonder how many people will use this at work to pretend that they are doing work while they listen to a podcast.

nwnwhwje · on Oct 23, 2024

This is cover for the people whose screens are recorded. Run this on the monitorred laptop to make you look busy then do the actual work on laptop 2, some of which might actually require thinking so no mouse movements.

marsh_mellow · on Sept 11, 2024

To tag on to this, what are the most useful capabilities besides code generation?

marsh_mellow · on Sept 11, 2024

This is great. Is there any work being done to make something similar part of the browser API?

troupo · on Sept 11, 2024

There was the Do Not Track header that this great industry of ours immediately used to track users

marsh_mellow · on Sept 10, 2024

There's an open source version of this as well: https://github.com/Codium-ai/pr-agent

marsh_mellow · on Sept 10, 2024

They list seven different use cases in this technical blog:

https://harrison.ai/news/reimagining-medical-ai-with-the-mos...

I'd interpret it as a foundation model in the radiology domain