More

mbh159 · 2026-02-22T22:07:19 1771798039

The methodology debate in this thread is the most important part.

The commenter who says "add obfuscation and success drops to zero" is right but that's also the wrong approach imo. The experiment isn't claiming AI can defeat a competent attacker. It's asking whether AI agents can replicate what a skilled (RE) specialist does on an unobfuscated binary. That's a legitimate, deployable use case (internal audit, code review, legacy binary analysis) even if it doesn't cover adversarial-grade malware.

The more useful framing: what's the right threat model? If you're defending against script kiddies and automated tooling, AI-assisted RE might already be good enough. If you're defending against targeted attacks by people who know you're using AI detection, the bar is much higher and this test doesn't speak to it.

What would actually settle the "ready for production" question: run the same test with the weakest obfuscation that matters in real deployments (import hiding, string encoding), not adversarial-grade obfuscation. That's the boundary condition.

celeryd · 2026-02-22T22:28:10 1771799290

Why does that matter? Being oblivious to obfuscated binaries is like failing the captcha test.

Let's say instead of reversing, the job was to pick apples. Let's say an AI can pick all the apples in an orchard in normal weather conditions, but add overcast skies and success drops to zero. Is this, in your opinion, still a skilled apple picking specialist?

sonofhans · 2026-02-22T23:00:01 1771801201

What if it’s 10x as fast during clear conditions? Then it doesn’t matter.

No hate. My only point is that’s it’s easy for analogies to fail. I can’t tell the point of either of your analogies, where the OP made several clear and cogent points.

xboxnolifes · 2026-02-22T23:17:15 1771802235

Maybe not, but also maybe you would no longer need skilled apple picking specialists.

mbh159 · 2026-02-20T22:01:32 1771624892

So cool, what's underappreciated imo: 17k tokens/sec doesn't just change deployment economics. It changes what evaluation means, static MMLU-style tests were designed around human-paced interaction. At this throughput you can run tens of thousands of adversarial agent interactions in the time a standard benchmark takes. Speed doesn't make static evals better it makes them even more obviously inadequate.

mbh159 · 2026-02-19T23:11:23 1771542683

The split here is between AI as amplifier vs. AI as replacement. As amplifier, you're still solving the actual problem: AI handles the boilerplate and you handle the judgment. As replacement, you lose the feedback loop that makes you better over time. The developers who thrive will be the ones who know which problems still require them to be in the loop. That's a skill that takes deliberate practice and inuition to develop and almost no AI tooling is designed to teach that.

mbh159 · 2026-02-19T20:12:16 1771531936

77.1% on ARC-AGI-2 and still can't stop adding drive-by refactors. ARC-AGI-2 tests novel pattern induction, it's genuinely hard to fake and the improvement is real. But it doesn't measure task scoping, instruction adherence, or knowing when to stop. Those are the capabilities practitioners actually need from a coding agent. We have excellent benchmarks for reasoning. We have almost nothing that measures reliability in agentic loops. That gap explains this thread.

mbh159 · 2026-02-18T04:00:43 1771387243

The 8% one-shot / 50% unbounded injection numbers from the system card are more honest than most labs publish, and they highlight exactly why you can't evaluate safety with static tests. An attacker doesn't get one shot — they iterate. The right metric isn't "did it resist this prompt" but "how many attempts until it breaks." That's inherently an adversarial, multi-turn evaluation. Single-pass safety benchmarks are measuring the wrong thing for the same reason single-pass capability benchmarks are: real-world performance is sequential and adaptive.

mbh159 · 2026-02-17T22:59:11 1771369151

This is the right direction to understanding AI capabilities. Static benchmarks let models memorize answers while a 300-turn Magic game with hidden information and sequencing decisions doesn't. The fact that frontier model ratings are "artificially low" because of tooling bugs is itself useful data: raw capability ≠ practical performance under real constraints. Curious whether you're seeing consistent skill gaps between models in specific phases (opening mulligan decisions vs. late-game combat math), or if the rankings are uniform across game stages.

GregorStocks · 2026-02-17T23:12:30 1771369950

A lot of models (including Opus) keep insisting in their reasoning traces that going first can be a bad idea for control decks, etc, which I find pretty interesting - my understanding is that the consensus among pros is closer to "you should go first 99.999% of the time", but the models seem to want there to be more nuance. Beyond that, most of the really interesting blunders that I've dug into have turned out to be problems with the tooling (either actual bugs, or MCP tools with affordances that are a poor fit for how LLMs assume they work). I'm hoping that I'm close to the end of those and am gonna start getting to the real limitations of the models soon.

mbh159 · 2026-02-06T03:18:52 1770347932

Like you said, theres a lot of complexity in the decision making here. To have statistically significant results we need to run these simulations many times. We record latency, tool calls, token consumption, etc. as well as results. Since we log the actions and their final outcomes we can run analysis later on the decisions correlations with success here. Our hypothesis is games provide an important benchmark for how these models will adapt in intelligence as they become more capable.

For example, I'm sure an RL bot will be able to figure out an optimal strategy over millions of simulations that defeats current LLMs with context, however this may not always hold true

mbh159 · 2026-02-05T23:28:07 1770334087

I've been thinking about how we can orchestrate the long-term planning logic better in this benchmark too, similar to how claude code has a planning step, maybe every X turns we introduce a planning calibrartion step much how like people are able to plan for multi-step turns.

Ie. we often see the same logic repeat: "Turn 70: I have 4 cities with 24 military units and 3 workers. Critical issues: Roma and Antium are flagged as undefended. I see phalanx #160 at Roma (10,58) and phalanx #171 at Antium (13,59) - they need to fortify for defense."

"Turn 70: I have 4 cities with 24 military units and 3 workers. Critical issues: Roma and Antium are flagged as undefended. I see phalanx #160 at Roma (10,58) and phalanx #171 at Antium (13,59) - they need to fortify for defense. I have a massive army of warriors that should be

and just earlier "Turn 68: I have 4 cities, opponent location unknown. Critical: Southgate (7,60) is undefended - Phalanx #167 is at (7,60), so I need to fortify it there. I have 23 military units but no enemy sighted yet. Priority: 1) Garrison Southgate with phalanx #167, 2) Fortify defenders in cities, 3) ..."

mbh159 · 2026-02-05T22:59:56 1770332396

thanks for checking it out, let me know if there's other game environments you'd want to see!

mbh159 · 2026-02-05T22:59:29 1770332369

polymarket market soon??