Hacker Newsnew | past | comments | ask | show | jobs | submit | mohsen1's commentslogin

O3 Pro could solve and prove the first problem when I tried:

https://chatgpt.com/s/t_687be6c1c1b88191b10bfa7eb1f37c07


I’ve been exploring this too, since I rely on LLMs a lot to build software. I’ve noticed that our dev loop-writing, testing-is often mostly human-guided, but language models frequently outperform us in reasoning. If we plug in more automation; MCP tools controlling browsers, documentation readers, requirement analysers, we can make the cycle much more automated, with less human involvement.

This article suggests scaling up RL by exposing models to thousands of environments

I think we can already achieve something similar by chaining multiple agents:

1. A “requirement” agent that uses browser tools to craft detailed specs from docs.

2. A coding agent that sets up environments (Docker, build tools) via browser or CLI.

3. A testing agent that validates code against specs, again through tooling.

4. A feedback loop where the tester guides the coder based on results.

Put together, this system becomes a fully autonomous development pipeline-especially for small projects. In practice, I’ve left my machine running overnight, and these agents propose new features, implement them, run tests, and push to repo once they pass. It works surprisingly well.

The main barrier is cost—spinning up many powerful models is expensive. But on a modest scale, this method is remarkably effective.


> The main barrier is cost

I very much disagree. For the larger, more sophisticated stuff that runs our world, it is not cost that prohibits wide and deep automation. It's deeply sophisticated and constrained requirements, highly complex existing behaviors that may or may not be able to change, systems of people who don't always hold the information needed, usually wildly out of date internal docs that describe the system or even how to develop for it, and so on.

Agents are nowhere near capable of replacing this, and even if they were, they'd change it differently in ways that are often undesirable or illegal. I get that there's this fascination with "imagine if it were good enough to..." but it's not, and the systems AI must exist in are both vast and highly difficult to navigate.


The status quo system you describe isn't objectively optimal. It sounds archaic to me. "We" would never intentionally design it this way if we had a fresh start. I believe it is this way due to a meriad of reasons, mostly stemming from the frailty and avarice of people.

I'd argue the opposite of your stance: we've never had a chance at a fresh start without destruction, but agents (or their near-future offspring) can hold our entire systems "in nemory", and therefore might be our only chance at a redo without literally killing ourselves to get there.


It's not claimed to be an "objectively optimal" solution, it's claimed to represent how the world works.

I don't know where you're going with discussion of destruction and killing, but even fairly simple consumer products have any number of edge cases that initial specifications rarely capture. I'm not sure what "objectively optimal" is supposed to mean here, either.

If a spec described every edge case it would basically be executable already.

The pain of developing software at scale is that you're creating the blueprint on the fly from high-level vague directions.

Something trivial that nevertheless often results in meetings and debate in the development world:

Spec requirement 1: "Give new users a 10% discount, but only if they haven't purchased in the last year."

Spec requirement 2, a year later: "Now offer a second product the user can purchase."

Does the 10% discount apply to the second product too? Do you get the 10% discount on the second product if you purchased the first product in the last year, or does a purchase on any product consume the discount eligibility? What if the prices are very different and customers would be pissed off if a $1 discount on the cheaper product (which didn't meet their needs in the end) prevented them from getting a 10$ discount 9 months later (which they think will)? What if the second product is a superset of the first product? What if there are different relevant laws in different jurisdictions where you're selling your product?

Agents aren't going to figure out the intent of the company's principal's automatically here because the decision maker doesn't actually even realize it's a question until the implementers get into the weeds.

A sufficiently advanced agent would present all the options to the person running the task, and then the humans could decide. But then you've slowed things back down the pace of the human decision makers.

The complexities only increase as the product grows. And once you get into distributed or concurrent systems even most of our code today is ambiguous enough about intent that bugs are common.


Agents quite literally cannot do this today.

Additionally, I disagree with your point:

> The status quo system you describe isn't objectively optimal.

On the basis that I would challenge you or anyone to judge what is objectively optimal. Google Search is a wildly complex system, an iceberg or rules on top of rules specifically because it is a digital infrastructure surrounding an organic system filled with a diverse group of people with ever-changing preferences and behaviors. What, exactly, would be optimal here?


"deeply sophisticated and constrained requirements"

Yes this resonates completely. I think many are forgetting the purpose of formal language and code was because natural language has such high ambiguity that it doesn't capture complex behavior

LLMs are great at interpolating between implicit and unsaid requirements but whether their interpolation matches your mental model is a dice throw


> they'd change it differently in ways that are often undesirable or illegal.

So...like SAP then?


Overall, I agree - it would take far more sophisticated and deterministic or 'logical' AI better capable of tracking constraints, knowing what to check and double check, etc... Right now, AI is far too scattered to pull that off (or, for the stuff that isn't scattered, it's largely just incapable), but a lot of smart people are thinking about it.

Imagine if...nevermind.


> but language models frequently outperform us in reasoning

what

99% of the time their reasoning is laughable. Or even if their reasoning is on the right track, they often just ignore it in the final answer, and do the stupid thing anyway.


There are 2 kinds of people. Those who are outperformed on their most common tasks by LLMs and those who aren’t.

there are also two kinds of people - those who are excited by that and those who are not.

The result is a 2x2 matrix where several quadrants are deeply concerning to me.


There are also two kinds of people - those who are objective enough to tell when it happens and those who will never even see when they’re outperformed because of their cognitive biases.

I give you a 2x2x2 matrix.


Sure, but if a person can find an easier way to do their job, they’ll usually do it. Usually the bias is towards less energy expenditure.

For many people, yes. For people who have their identity invested in being the smartest person in the room, life is considerably harder.

I'm sure if we work hard enough we can add a meta-meta-cognition level. Cognition is just 2^n series of binary states right?

> I give you a 2x2x2 matrix.

That'd be a tensor, no?


A rank-3 tensor, yes. Matrices are rank-2 tensors.

God I hated tensors in grad school. Give me a Taylor series any day.

Everything that seems complicated is just a "fancy matrix".

Which quadrant is NOT concerning to you?

The best part when a “thinking” model carefully thinks and then says something that is obviously illogical, when the model clearly has both the knowledge and context to know it’s wrong. And then you ask it to double check and you give it a tiny hint about how it’s wrong, and it profusely apologizes, compliments you on your wisdom, and then says something else dumb.

I fully believe that LLMs encode enormous amounts of knowledge (some of which is even correct, and much of which their operator does not personally possess), are capable of working quickly and ingesting large amounts of data and working quickly, and have essentially no judgment or particularly strong intelligence of the non-memorized sort. This can still be very valuable!

Maybe this will change over the next few years, and maybe it won’t. I’m not at all convinced that scraping the bottom of the barrel for more billions and trillions of low-quality training tokens will help much.


I feel like one coding benchmark should be just telling it to double check or fix something that's actually perfectly fine repeatedly and watch how bad it deep fries your code base.

They key difference between that and humans, if course, is that most humans will double down on their error and insist that your correction is wrong, throwing a kitchen sink of appeals to authority, motte/bailey, and other rhetorical techniques at you.

That's not any different in practice to the LLM "apologising" to placate you and then making a similar mistake again.

It's not even a different strategy. It's just using rhetoric in a more limited way, and without human emotion.

These are style over substance machines. Their cognitive abilities are extremely ragged and unreliable - sometimes brilliant, sometimes useless, sometimes wrong.

But we give them the benefit of the doubt because they hide behind grammatically correct sentences that appear to make sense, and we're primed to assume that language = sentience = intelligence.


True "interruption" requires continuous learning, and the current model is essentially a dead frog, and frozen weights cannot be truly grounded in real time.

https://news.ycombinator.com/item?id=44488126


Yea I don't understand how people are "leaving it running overnight" to successfully implement features. There just seems to be a large disconnect between people who are all in on AI development and those who aren't. I have a suspicion that the former are using Python/JS and the features they are implementing are simple CRUD APIs while the latter are using more than simple systems/languages.

I think the problem is that despite feeding it all the context and having all the right MCPs agents hooked up, is that there isn't a human-in-loop. So it will just reason against itself causing these laughable stupid decisions. For simple boilerplate tasks this isn't a problem. But as soon as the scope is outside of a CRUD/boilerplate problem, the whole thing crumbles.


I'd really like to know which use cases work and which don't. And when folks say they use agentic AI to churn through tokens to automate virtually the entire SDLC, are they just cherry picking the situations that turned out well, or do they really have prompting and workflow approaches that indeed increase their productivity 10-fold? Or, as you mention, is it possibly a niche area which works well?

My personal experience the past five months has been very mixed. If I "let 'er rip" it's mostly junk I need to refactor or redo by micro-managing the AI. At the moment, at least for what I do, AI is like a fantastic calculator that speeds up your work, but where you still should be pushing the buttons.


Or - crazy idea here - they're just full of it.

I haven't seen an LLM stay on task anywhere near that long, like...ever. The only thing that works better left running overnight that has anything to do with ML, in my experience, is training.


Yes, if a LLM outperforms you, you have never reasoned in your life.

I will assume you passed high-school based on your looks and not on your abilities.


99% of the time their reasoning is laughable. Or even if their reasoning is on the right track, they often just ignore it in the final answer, and do the stupid thing anyway.

99% chance you're using the wrong model.

Effective tool use is a valuable skill, arguably the only one that still matters.


RL is a training method and it improves the model itself. So basically one step(e.g. successful test run, finding search result) could create positive and negative examples for the other step(e.g. coding agent, search agent). And using this the base itself will improve to satisfy other demands and if it reaches close to 100% accuracy(which I believe it could as models mostly fail due to dumb mistakes in tests), you don't need the testing agent altogether.

Neat! and minimum delay can be done with Promise.race


Robotic data is perhaps not sensor and motor activation data. It's just video of things happening. A good model doesn't need that sort of data to be good at world modeling


While you're not wrong, I suspect that people focusing on this will lead to less fast robotics development.

It's in a similar vein to how "you can prove that a single layer neural network exists that does the same thing as a combination of many neural networks", something that led to a lot of people focusing on single level NNs for "purity's sake" and causing an AI winter.

Like yeah, maybe you don't "need" sensor and motor data. Especially if you build what you're calling a "good" model.

But making a "good" model that gets results might be near impossible and building a "less good" model that does use sensor data, and performs far better on real tasks, might be way easier for us mere mortals to do.


One of the viewpoints covered in the article: "Another set of people argued that we can leverage existing vision, language, and video data and then just ‘sprinkle in’ some robotics data."


I've had success using BrowserMCP

https://browsermcp.io

It really feels magical when the AI agent can browse and click around to understand the problem at hand

Also, sometimes an interactive command can stop agents from doing things. I wrote a small wrapper to always return so agents never stop from working

https://github.com/mohsen1/agentshell


مشک آن است که خود ببوید نه آنکه عطار بگوید

“Musk is that which smells by itself, not what the perfumer says (about it).”

This line is from Saadi Shirazi, the classical Persian poet which has become a proverb in Persian speaking world. Reviews are at this point what the seller wants you to read.

As long as Amazon is the seller, and host of the reviews there is no way to trust Amazon would be fair in hosting those reviews.

The only way to know about a product is to read about it elsewhere like New York Times which is not selling the product themselves.


More often than not those sources are getting paid for product placement. Wire-cutter, NYT, JD Power, Wired - they all get advertisement money for "reviews".


Of course. But they have a reputational stake in their recommendation as well.

I take a Wirecutter top pick as meaning something very different from Bobby123's glowing review. Wirecutter may have been influenced by ad money a bit. Bobby123 may not even exist or may be entirely driven by seller compensation. And I'll never see Bobby123 again.


Archive article


Correct article:

https://archive.is/ByOCT


Thank you both.

Unrelated note to publishing platforms: Instead of using Archive to read your stuff, I'd rather pay for your content; but until a mechanism to send microtransactions to one-off view individual articles, I'm not going to ever sign up monthly (e.g. send 0.0001 XMR to address 1MADEup5649879846513216547, which then creates a machine-local cookie to allow viewing of content indefinitely/timed). Perhaps a GUI similar to the old .RSS feeds, with a central CC/crypto processor which pays per article?


No one's interested in getting our 0.0001 whatever coin sadly. They're chasing the whales who are willing to commit to a 10 year subscription.


This could be a neat addition to substack's platform/design/profits.

----

My local newspaper (the father of NYT) expects its largely-aged population to shell out $35/month for daily prints. Some bozo thought offering a free iPad would really jive with this population ( ̄_(ツ)_/ ̄)


I'm in a gloomy mood.

Suppose someone comes up with a platform that allows us to pay 0.05$ for one article. They count articles we've selected and charge us when it makes sense due to credit card fees, probably at $10 worth of views.

How long will it take till they start pushing ... a subscription?

Pay us $10/month for $12 worth of article views! 20% savings! *

* Terms and condisions apply. Your $10/month subscription is only good for the 3rd tier content, but there are $29 and $99/month subscriptions that allow access to 2nd tier and first tier. Price valid for a 3 year commitment.


>How long will it take till they start pushing ... a subscription?

I think this could be countered by having the platform take (e.g.) 10% of each transaction, which might disincentivize subscription models..?


No no. I mean the micropayment platform will start pushing a subscription to said micropayment platform after a while.


My ultimate goal is that authors/publishers get paid a little bit of something rather than a massive nothing (because the all-or-nothing subcribing to individual websites isn't happening).

I feel this issue echoes digital piracy, which I was happy to cease until there were dozens of individual platforms created. If you make it easier to not steal than to steal, less people will steal (is my premise, at least).

For background: I am a blue collar electrician, not a techie. I just want a simple method of paying for consumed news — without having to pay for access to an entire newspaper (which I only want to read ONE article).


To be a genuinely good Airbnb host and offer your home at a fair price, two things are usually required:

1. A well-furnished, comfortable home — which often comes with a higher income bracket.

2. The time, energy, and motivation to handle hosting duties — managing logistics, cleaning, communication, etc.

The problem is, those two qualities rarely overlap. People with high-quality homes often have demanding careers or other priorities and don’t want the uncertainty or effort of renting to strangers — especially when guest quality can vary a lot.

That’s partly why platforms like Kindred are interesting. They focus on home exchanges, not rentals, so everyone involved is both a guest and a host. That creates better alignment in expectations and care. There’s no pricing involved, no taxes, and a built-in sense of trust — since it’s a mutual exchange.

It’s not a perfect solution, but it shifts the dynamic in a way that feels more human and less transactional.

If you’re curious, here’s a referral link with 5 free credits: https://livekindred.com?invite_code=MOH.AZI


iPad update is going to encourage a new series of folks trying to use iPads for general programming. I'm curious how it goes this time around. I'm cautiously optimistic


Isn't it still impossible to run any dev tools on the iPad?


IIRC Swift Playgrounds goes pretty deep -- a full LLVM compiler for Swift and you can use any platform API -- but you can't build something for distribution. The limitations are all at the Apple policy level.


Not quite. As another user mentioned, there's Swift Playgrounds which is complete enough that you can even upload apps made in it to the App Store. Aside from that, there are also IDEs like Pythonista for creating Python-based apps and others for Lua, JavaScript, etc. many of which come with their own frameworks for making native iOS/iPadOS interfaces.


I can assume that they are going to bring the Container stuff to iPad at some point. That would unlock so many things...


No vscode, no deal. I don't see that happening any time soon.


I think the story might actually be changing this time


You can’t run Docker on an iPad.


very impressive numbers in those benchmarks. I'm trying it out right now to see how it performs in my cases. I switched to Opus 4 but I can easily be convinced to switch back to Gemini


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: