More

dudeinhawaii · 2026-01-09T19:28:43 1767986923

The most annoying thing in the LLM space is that people write articles and research with grand pronouncements based upon old models. This article has no mention of Sonnet 4.5, nor does it use any of the actual OpenAI coding models (GPT-5-Codex, GPT-5.1 Codex, etc), and based upon that, even the Opus data is likely an older version.

This then leads to a million posts where on one side people say "yeah see they're crap" and on the other side people are saying "why did you use a model from 6 months ago for your 'test' and write up in Jan 2026?".

You might as well ignore all of the articles and pronouncements and stick to your own lived experience.

The change in quality between 2024 and 2025 is gigantic. The change between early 2025 and late 2025 is _even_ larger.

The newer models DO let you know when something is impossible or unlikely to solve your problem.

Ultimately, they are designed to obey. If you authoritatively request bad design, they're going to write bad code.

I don't think this is a "you're holding it wrong" argument. I think it's "you're complaining about iOS 6 and we're on iOS 12.".

dudeinhawaii · 2026-01-08T15:17:41 1767885461

That is one take and certainly possible and negative but I think people create libraries for different reasons.

There are people who will use AI (out of their own pocket for trivial costs) to build a library and maintain it simply out of the passion, ego, and perhaps some technical clout.

That's the same with OSS libraries in-general. Some are maintained at-cost, others are run like a business where the founders try to break even.

dudeinhawaii · 2026-01-07T19:24:50 1767813890

LLMS like Opus, Gemini 3, and GPT-5.2/5.1-Codex-max, are phenomenal for coding and have only very recently crossed that gap between being "eh" and being quite fantastic to let operate on their own agentically. The major trade-off being a fairly expensive cost. I ran up $200 per provider after running through 'pro' tier limits during a single week of hacking over the holidays.

Unfortunately, it's still surprisingly easy for these models to fall into really stupid maintainability traps.

For instance today, Opus adds a feature to the code that needs access to a db. It fails because the db (sqlite) is not local to the executable at runtime. Its solution is to create this 100 line function to resolve a relative path and deal with errors and variations.

I hit ESC and say "... just accept a flag for --localdb <file>". It responds with "oh, that's a much cleaner implementation. Good idea!". It then implements my approach and deletes all the hacks it had scattered about.

This... is why LLMs are still not Senior engineers. They do plainly stupid things. They're still absurdly powerful and helpful, but if you want maintainable code you really have to pay attention.

Another common failure is when context is polluted.

I asked Opus to implement a feature by looking up the spec. It looked up the wrong spec (a v2 api instead of a v3) -- I had only indicated "latest spec". It then did the classic LLM circular troubleshooting as we went in 4 loops trying to figure out why calculations were failing.

I killed the session, asked a fresh instance to "figure out why the calculation was failing" and it found it straight away. The previous instance would have gone in circles for eternity because its worldview had been polluted by assumptions made -- that could not be shaken.

This is a second way in which LLMs are rigid and robotic in their thinking and approach -- taking the wrong way even when directed not to. Further reading on 'debugging decay': https://arxiv.org/abs/2506.18403

All this said, the number of failure scenarios gets ever smaller. We've gone from "problem and hallucination every other code block" to "problem every 200-1000 code blocks".

They're now in the sweet spot of acting as a massive accelerator. If you're not using them, you'll simply deliver slower.

dudeinhawaii · 2026-01-07T19:10:05 1767813005

This is a cringe comment from an era of when "Micro$oft" was hip and reads like you are a fanboi for Anthropic/Google foaming at the mouth.

Would be far more useful if you provided actual verifiable information and dropped the cringe memes. Can't take seriously someone using "Microslop" in a sentence".

dudeinhawaii · 2025-12-12T04:52:01 1765515121

I noticed a quite noticeable improvement to the point where I made it my go-to model for questions. Coding-wise, not so much. As an intelligent model, writing up designs, investigations, general exploration/research tasks, it's top notch.

dudeinhawaii · 2025-12-11T22:55:54 1765493754

This is one of those areas where I think it's about the complexity of the task. What I mean is, if you set codex to xhigh by default, you're wasting compute. IF you're setting it at xhigh when troubleshooting a complex memory bug or something, you're presumably more likely to get a quality response.

I think in general, medium ends up being the best all-purpose setting while high+ are good for single task deep-drive. Or at least that has been my experience so far. You can theoretically let with work longer on a harder task as well.

A lot appears to depend on the problem and problem domain unfortunately.

I've used max in problem sets as diverse as "troubleshooting Cyberpunk mods" and figuring out a race condition in a server backend. In those cases, it did a pretty good job of exhausting available data (finding all available logs, digging into lua files), and narrowing a bug that every other model failed to get.

I guess in some sense you have to know from the onset that it's a "hard problem". That in and of itself is subjective.

wahnfrieden · 2025-12-12T03:51:31 1765511491

You should also be making handoffs to/from Pro

dudeinhawaii · 2025-12-11T22:49:30 1765493370

What does this add to the conversation? This isn't Reddit.

dudeinhawaii · 2025-12-05T18:26:50 1764959210

This is the complete wrong way to do this. I say this as someone who does work in this area of leveraging LLMs to a limited degree in trading.

LLMs are naive, easily convinced, and myopic. They're also non-deterministic. We have no way of knowing if you ran this little experiment 10 times whether they'd all pick something else. This is a scattershot + luck.

The RIGHT way to do this is to first solve the underlying problem deterministically. That is, you first write your trading algorithm that's been thoroughly tested. THEN you can surface metadata to LLMs and say things along the lines of "given this data + data you pull from the web", make your trade decision for this time period and provide justification.

Honestly, adding LLMs directly to any trading pipeline just adds non-useful non-deterministic behavior.

The main value is speed of wiring up something like sentiment analysis as a value add or algorithmic supplement. Even this should be done using proper ML but I see the most value in using LLMs to shortcut ML things that would require time/money/compute. Trading value now for value later (the ML algorithm would ultimately run cheaper long-run but take longer to get into prod).

This experiment, like most "I used AI to trade" blogs are completely naive in their approach. They're taking the lowest possible hanging fruit. Worst still when those results are the rising tide lifting all boats.

Edit (was a bit harsh) This experiment is an example of the kind of embarrassingly obvious things people try with LLMs without understanding the domain and writing it up. To an outsider it can sound exciting. To an insider it's like seeing a new story "LLMs are designing new CPUs!". No they're not. A more useful bit of research would be to control for the various variables (sector exposure etc) and then run it 10_000 times and report back on how LLM A skews towards always buying tech and LLM B skews towards always recommending safe stocks.

Alternatively, if they showed the LLM taking a step back and saying "ah, let me design this quant algo to select the best stocks" -- and then succeeding -- I'd be impressed. I'd also know that it was learned from every quant that had AI double check their calculations/models/python.. but that's a different point.

dudeinhawaii · 2025-12-02T20:18:33 1764706713

I have to agree with you but I'll remain a skeptic until the preview tag is dropped. I found Gemini 2.5 Pro to be AMAZING during preview and then it's performance and quality unceremoniously dropped month after month once it went live. Optimizations in favor of speed/costs no doubt but it soured me on jumping ship during preview.

Anthropic pulled something similar with 3.6 initially, with a preview that had massive token output and then a real release with barely half -- which significantly curtails certain use cases.

That said, to-date, Gemini has outperformed GPT-5 and GPT5.1 on any task I've thrown at them together. Too bad Gemini CLI is still barely useful and prone to the same infinite loop issues that have plagued it for over a year.

I think Google has genuinely released a preview of a model that leapfrogs all other models. I want to see if that is what actually makes it to production before I change anything major in my workflows.

dudeinhawaii · 2025-12-02T20:04:22 1764705862

Did you wait a while before downloading? The links it provides for temporary projects have a surprisingly brief window where you can download them. I've had similar experience when even waiting 1 minute to download the file.