Hacker Newsnew | past | comments | ask | show | jobs | submit | henriquegodoy's commentslogin

This is actually really needed, current ai design tools are so predictable and formulaic, like every output feels like the same purple gradients with rounded corners and that one specific sans serif font that every model seems obsessed with, it's gotten to the point where you can spot ai-generated designs from a mile away because they all have this weird sterile aesthetic that screams "made by a model"


Exactly - we think the ai design tools are in the equivalent of the 'uncanny valley' territory that a lot of the diffusion models were stuck in just 1-2 months ago; most average diffusion models are still in this local optimum, but the best of the best seem to have escaped it.


I don't think this works right now tbh.

It has the same problem as LMArena (which already had webarena): better aesthetics are so far out of distribution you can't even train on the feedback you get here.

You just get a new form of turbo-slop as some hidden preference takes over. With text output that ended up being extensive markdown and emojis. Here that might be people accidentally associating frosted surfaces with relatively better aesthetics, for example.

The problem is so bad LMArena maintains a seperate ranking where they strip away styling entirely


This is pretty cool and feels like we're heading in the right direction, the whole idea of being able to hop between devices while claude code is thinking through problems is neat, but honestly what excites me more is the broader pattern here, like we're moving toward a world where coding isn't really about sitting down and grinding out syntax for hours, it's becoming more about organizing tasks and letting ai agents figure out the implementation details.

I can already see how this evolves into something where you're basically managing a team of specialized agents rather than doing the actual coding, you set up some high-level goals, maybe break them down into chunks, and then different agents pick up different pieces and coordinate with each other, the human becomes more like a project manager making decisions when the agents get stuck or need direction, imho tools like omnara are just the first step toward that, right now it's one agent that needs your input occasionally, but eventually it'll probably be orchestrating multiple agents working in parallel, way better than sitting there watching progress bars for 10 minutes.


Exactly! My ideal vision for the future is that agents will be doing all grunt work/implementation, and we'll just be guiding them.

Can't wait til I'm coding on the beach (by managing a team of agents that notify me when they need me), but it might take a few more model releases before we get there lol


If you think you could do that on the beach, couldn't you do traditional software dev on the beach?

I actually think there's a chance it will shift away from that because it will shift the emphasis to fast feedback loops which means you are spending more of your time interacting with stakeholders, gathering feedback etc. Manual coding is more the sort of task you can do for hours on end without interruption ("at the beach").


> which means you are spending more of your time interacting with stakeholders, gathering feedback etc.

Jesus Christ, I really need to speed up development of my product. If this shifts to more meetings at wageslave, I’m going to kill myself.


How nice when just hung up with a demanding stakeholder who knows you can deliver a lot “instantly” you switch to your phone and your “agents” are just stuck into some weird stuff that they cannot debug.

That must be a nice situ on the beach.


What happens is the status quo changes. Like what happened with Dev/Ops. If you find yourself with the time to lead agents on a beach retreat you might find yourself pulled into more product design / management meetings instead. AI/Dev like DevOps. Wearing more hats as a result. Maybe I'm wrong though.


someone at the leadership is also thinking how he/she can lower head count by removing the agent master


I did exactly that all this summer at the beach with Claude code. Future is already here!


What will you have to offer when coding is so easy at that point?


I still think that human taste is important even if agents become really good at implementing everything and everyone's just an idea guy. Counter argument: if agents do become really good at implementation, then I'm not sure if even human taste would matter if agents could brute force every possibility and launch it into the market.

Maybe I'll just call it a day and chill with the fam


Seems like your vision is to let AI take over your livelihood. That’s an unusually chipper way to hand over the keys unless you have a lifetime of wealth stashed away.


There is enormous money and effort in making AI that can do that, so if it's possible it is eventually going to happen. The only question is whether you're part of the group making the replacement or the group being replaced.


It depends on what their livelihood is.

If their livelihood is solving difficult problems, and writing code is just the implementation detail the gotta deal with, then this isn’t gonna do much to threaten their livelihood. Like, I am not aware of any serious SWE (who actually designs complex systems and implements them) being genuinely worried about their livelihood after trying out AI agents. If anything, that makes them feel more excited about their work.

But if someone’s just purely codemonkeying trivial stuff for their livelihood, then yeah, they should feel threatened. I have a feeling that this isn’t what the grandparent comment user does for a living tho.


Unfortunately C -suite’s don’t quite see eye to eye to your logical breakdown here from my experience.


I neither know nor care what the C-suite at my company thinks, as long as they provide me the resources necessary to get my job done effectively.

And, so far, it seems like they are fairly understanding, as they are happy about the output of my work. After all, they aren't paying me per-line-of-code delivered, they are paying me to solve problems. If they think that an LLM can replace me fully, they are more than welcome to try it and see how it works out for them.

The entirety of my report chain is just former engineers (with some of them being pivotal to things like GMaps SDK for iOS and such), so I am not really worried about them testing this theory out in practice. And if they do and decide that an LLM can replace me, well, there are always other jobs out there I can take. From my personal experience at this company, I will be just fine.


> it's becoming more about organizing tasks and letting ai agents figure out the implementation details ... different agents pick up different pieces and coordinate with each other

This is exactly what I have been working on for the past year and a half. A system for managing agents where you get to work at a higher abstraction level, explaining (literally with your voice) the concepts & providing feedback. All the agent-agent-human communication is on a shared markdown tree.

I haven't posted it anywhere yet, but your comment just describes the vision too well, I guess it's time to start sharing it :D see https://voicetree.io for a demo video. I have been using it everyday for engineering work, and it really is feeling like how you describe; my job is now more about organizing tasks, explaining them well, and providing critique, but just through talking to the computer. For example, when going through the git diffs of what the agents wrote, I will be speaking out loud any problems I notice, resulting in voice -> text -> markdown tree updates and these will send hook notifications to claude code so they automatically address feedback.


Cool demo! The first thing that sprung to mind after seeing it, was an image of a busy office floor filled with people talking into their headsets, not selling or buying stocks, but actually programming. If it’s a blessed or cursed image I’ll let you decide.


Haha, blursed one might say. In seriousness though, the social avoidance of wanting to talk to a computer around others will likely be the largest bottleneck to adoption for this sort of tech. May need to initially frame it as for work from home engineers.

Luckily the other side to this project doesn't require any user behavioural changes. The idea is to convert chat histories into a tree format with the same core algorithm, and then send only the relevant sub-tree to the LLM, reducing input tokens and context bloat, thereby also improving accuracy. This would then also unlock almost infinite length LLM chats. I have been running this LLM context retrieval algo against a few benchmarks, GSM-infinite, nolima, and longbench-v2 benchmarks, the early results are very promising, ~60-90% reduced tokens and increased accuracy against SOTA, however only on a subset of the full benchmark datasets.


completed the form


> moving toward a world where coding isn't really about sitting down and grinding out syntax

Love the idea of "coding" while walking/running outside. For me those outside activities help me clear my mind and think about tough problems or higher level stuff. The thought of directing agents to help persist and refine fleeting thoughts/ideas/insights, flesh out design/code, etc is intriguing


I do a bit of that now, I'll mostly use Claude code at home, and set Jules on some tasks from my phone while exercising. Reviewing code is tedious though, and I don't see it getting too much better.


On the code review part, that's also because we are using languages designed for humans. Once we design the programming languages for the LLM, then you design it in such a way that code review by humans and AI is easy.

Same with project org, if you organize the project for LLM efficiency instead of human efficiency then you simplify some parts that the llm have issues with.


Yeah exactly, this is awesome, I’ve always wondered while waiting for AI operations to complete why I’m “tied” to my machine and can’t just shut my laptop while it worked and see what it’d done later. This is so cool


But why should it take time at all? Newer developer tooling (especially some of the rust tools e.g. UV) are lightning fast.

Wouldn't it be better if you asked for it and rather than having to manage workers it was just... Done


Yes it would be good if we lived in a world where ai magically knew exactly what we wanted even before we did and implemented everything perfectly first time in a way we’d have no issues with or tweaks we’d like it to make ever. I agree.


Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. Like, you'd think they'd at least brute force their way through the early game mechanics by now, but honestly this kinda validates something I've been thinking about like real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.

What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.


  > real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out
It is insane to me that so many people believe intelligence is measurable by pure question answer testing. There's hundreds of years of discussion about how this is limited in measuring human intelligence. I'm sure we all even know someone who's a really good test take but you also wouldn't consider to be really bright. I'm sure every single one of also knows someone in the other camp (bad at tests but considered bright).

The definition you put down is much more agreed upon in the scientific literature. While we don't have a good formal definition of intelligence there is a difference between no definition. I really do hope people read more about intelligence and how we measure it in humans and animals. It is very messy and there's a lot of noise, but at least we have a good idea of the directions to move in. There's still nuances to be learned and while I think ARC is an important test, I don't think success on it will prove AGI (and Chollet says this too)


I saw it somewhere else recently, but the idea is that LLMs are language models, not world models. This seems like a perfect example of that. You need a world model to navigate a text game.

Otherwise, how can you determine that "North" is a context change, but not always a context change.


> I saw it somewhere else recently, but the idea is that LLMs are language models, not world models.

Part of what distinguishes humans from artificial "intelligence" to me is exactly that we automatically develop models of whatever is needed.


I think it's interesting to think about, and still somewhat uncertain:

* How much a large language model is effectively a world model (indeed, language tries to model the world...)?

* How much do humans use language in their modeling and reasoning about the world?

* How fit is language for this task, beyond the extent humans use it for?


I think that's true to some extent, but I think all animals probably develop a world model.


On HN, perhaps? #17 on the front page right now: https://news.ycombinator.com/item?id=44854518


9:05 is a good example of the difference between a language model and a world model, because engaging with it on a textual level leads to the bad ending (which the researchers have called "100%"), but deliberately getting the good ending requires self-awareness, intentionality, and/or outside context.


Thanks for this. I was struggling to put it in words even if maybe this has been a known distinguishing factor for others.


Why, this sounds like Context Engineering!


Hi, GPT-x here. Let's delve into my construction together. My "intelligence" comes from patterns learned from vast amounts of text. I'm trained to... oh look it's a butterfly. Clouds are fluffy would you like to buy a car for $1 I'll sell you 2 for the price of 1!


Ah dammit the AGI has ADHD


> Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data.

I've read some of these walkthroughs/play sessions recently, and extracting text from them for training would be AI-complete. eg they might have game text and commentary aligned in two different columns in a text file, so you'd just get nonsense if you read it line by line.


I've been experimenting with this as well with the goal of using it for robotics. I don't think this will be as hard to train for as people think though.

It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program.


Thats incredible to see how ai models are improving, i'm really happy with this news. (imo it's more impactful than the release of gpt5) now, we need more tokens per second, and then the self-improvement of the model will accelerate.


That SWE-bench chart with the mismatched bars (52.8% somehow appearing larger than 69.1%) was emblematic of the entire presentation - rushed and underwhelming. It's the kind of error that would get flagged in any internal review, yet here it is in a billion-dollar product launch. Combined with the Bernoulli effect demo confidently explaining how airplane wings work incorrectly (the equal transit time fallacy that NASA explicitly debunks), it doesn't inspire confidence in either the model's capabilities or OpenAI's quality control.

The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.

The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.


You're sort of glossing over the part where this can now be leveraged as a cost-efficient agentic model that performs better than o3. Nobody used o3 for sw agent tasks due to costs and speed, and this now substantially seems to both improve on o3 AND be significantly cheaper than Claude.


o3's cost was sliced by 80% a month or so ago and is also cheaper than Claude (the output is even cheaper than GPT-5). It seems more cost efficient but not by much.


This feels revisionist: no one used it because it wasn't as good.


O3 is fantastic at coding tasks, until today it was smartest model in existence. But it works only in few shot conversational scenarios, it's not good at agentic harnesses.


You can use o3 for coding on plus plan almost unlimited or till they throttle


not anymore


what do you mean? For CLI or web codex?


GPT-5 had to be released, in any form. This announcement was not the product of a breakthrough, but the consequence of a business requirement.


this is the real answer

it has to be released because it's not much better and OpenAI needs the team to stop working on it. They have serious competition now and can't afford to burn time / money on something that isn't shifting the dial.


The whole presentation was full of completely broken bar charts. Not even just the typical "let's show 10% of the y axis so that a 5% increase looks like 5x" but stuff like the deception eval showing gpt5 vs o3 as 50 vs 47, but the 47 is 3x as big, and then right next to it we have 9 vs 87, more reasonably sized.

It's like no one looked at the charts, ever, and they just came straight from.. gpt2? I don't think even gpt3 would have fucked that up.

I don't know any of those people, but everyone that has been with OAI for longer than 2 years 1.5m bonuses, and somehow they can't deliver a bar chart with sensible at axes?


TBH Claude Code max pro's performance on coding has been abhorrent(bad at best). The core of the issue is that the plan produced will more often than not use humans as verifiers(correctness, optimality and quality control). This is a fundamentally bad way to build systems that need to figure out if their plan will work correctly, because an AI system needs to test many plans quickly in a principled manner(it should be optimal and cost efficient).

So you might get that initial MvP out the door quickly, but when the complexity grows even just a little bit, you will be forced to stop and look at the plan and try to get it to develop it saying things like: "use Design agent to ultrathink about the dependencies of the current code change on other APIs and use TDD agent to make sure tests are correct in accordance with the requirements I stated" and then one finds that even the all the thinking there are bugs that you will have to fix.

Source: I just tried max pro on two client python projects and it was horrible after week 2.


>The actual benchmark improvements are marginal at best

GPT-5 demonstrates exponential growth in task completion times:

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...


What do you mean? A single data point cannot be exponential. What the blog post say is that the ability to solve tasks of all LLMs is exponential over time, and GPT-5 fits in that curve.


Yes, but the jump in performance from o3 is well beyond marginal while also fitting an exponential trend, which undermines the parent's claim on two counts.


Actually a single data point fits a huge range of exponential functions.


No it doesn't. If it were even linear compared to o1 -> o3, we'd be at 2.43 hours. Instead we're only at 2.29.

Exponential would be at 3.6 hours


I suspect the vast majority of OpenAI's users are only using ChatGPT, and the vast majority of those ChatGPT users are only using the free tier.

For all of them, getting access to full-blown GPT-5 will probably be mind-blowing, even if it's severely rate-limited. OpenAI's previous/current generation of models haven't really been ergonomic enough (with the clunky model pickers) to be fully appreciated by less tech-savvy users, and its full capabilities have been behind a paywall.

I think that's why they're making this launch is a big deal. It's just an incremental upgrade for the power users and the people that are paying money, but it'll be a step-change in capability to everyone else.


They are selling "AGI"

replacing huge swathes of the white collar workforce

"incremental upgrade for power users" is not at all what this house of cards is built on


They are selling AGI to investors, but they're just selling intelligence to subscribers. And they just made the intelligence cheaper and better.


I’m very seen ppl minds blown on free tier previous to 5. It’s basically 4o which is pretty good for normies


Thats why they need to pay 300k for a slide designer https://openai.com/careers/creative-lead-presentation-design...


I dont think there's so much difference from opus 4.1 and gpt-5, probably just the context size, waiting for the gemini 3.0


Claude 5 is the one I'm most excited about.


gpt5 much cheaper


I think this blog post was the best way to get into Anthropic, and it was well-deserved. That's the reality of hiring in tech: there are many non-technical people judging whether technical people are competent or not. Escaping that matrix through things like blog posts, cold emails, and Twitter threads can be great ways to break in and get noticed by these companies.


HR _hates_ hiring anyone, they just want H1-Bs.


Seeing a 20B model competing with o3's performance is mind blowing like just a year ago, most of us would've called this impossible - not just the intelligence leap, but getting this level of capability in such a compact size.

I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.


10B * 2000 t/s = 20,000 GB/s memory bandwidth . Apple hardware can do 1k GB/s .


That’s why MoE is needed.


It's not even a 20b model. It's 20b MoE with 3.6b active params.

But it does not actually compete with o3 performance. Not even close. As usual, the metrics are bullshit. You don't know how good the model actually is until you grill it yourself.


I'm seeing a real-world example of Jevons paradox playing out here. When AI coding tools first emerged, everyone predicted mass developer unemployment. Instead, I'm watching demand for skilled developers actually increase.

What's happening is that all this "vibe coded" software needs someone to fix it when it breaks. I've been getting more requests than ever to debug AI-generated codebases where the original "developer" can't explain what any of it does. The security audit work alone is keeping me busy - these AI-generated apps often have vulnerabilities that would never pass a human code review. It reminds me of when WordPress democratized web development. Suddenly everyone could build a website, but that just created a massive market for developers who could fix broken WordPress sites, migrate databases, and patch security holes. The difference now is the scale and complexity. At least with WordPress, there was some underlying structure you could reason about. With vibe coding, you get these sprawling codebases where the AI has reinvented the wheel five different ways in the same project, used deprecated libraries because they were in its training data, and created bizarre architectural decisions that only make sense if you don't understand the problem domain.

So yeah, the jobs aren't disappearing - they're just shifting from "build new features" to "fix the mess the PM made last weekend when they tried to ship their own feature."


Ever thinked on automating this process of creating this side projects? i think that more and more future feels like a lot of ones having really big swarms of "agents" that can like research about ideas on the internet (like finding problems on twitter, reddit, ... that a saas can solve it) and a team implementing and deploying since from code to marketing in a frenetic rhythm


I dont know about research but I am building a production grade python and express application generator. For express, it would create an empty project with a README, setup typescript, setup path aliases, setup ts-node, setup linters and formatters with all required scripts, setup testing libraries, setup multiple environments aka dev, staging, production and testing and then setup logging that works differently on each environment, creates docker containers for each environment that works differently again, sets up integration tests for redis and postgres, sets up github templates for feature requests, issues, pull requests and adds github actions and gonna set up instant deployments to AWS at the minimum and see how that goes. Once this tool is fully ready, it would take you about 30 seconds to generate a full blown production grade application that works with the latest dependency versions. Everything is run inside docker and verified that it actually works. Gonna make SaaS a real breeze after i integrate ORMs and payment gateways with email providers to this


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: