More

dbingham · 2025-11-19T13:18:20 1763558300

> We can incorporate, but can we indemocrate?

I'm working on this right now. I'm building a Facebook alternative that will be a nonprofit, multistakeholder cooperative if I can get it off the ground. It won't be owned by anyone, instead it will be governed by its workers and users in collaboration.

It's called Communities (https://communities.social) and it's in open beta now. We got the apps in the app stores last month.

miki123211 · 2025-11-19T14:09:34 1763561374

How do you ensure "one user, one vote?"

This seems impossible without intrusive government ID verification, and not immune to government meddling in any case.

dbingham · 2025-11-19T14:39:34 1763563174

To be determined. It's not a problem we have yet, since we're going to ease into the cooperative governance.

Right now it's an LLC. If we can hit basic financial stability, then we'll convert the LLC to a nonprofit and start with an appointed board with a two year term who's job is to draft the permanent bylaws and define the electoral system. Basically, I'm bootstrapping it and we need to raise the money to pay the legal fees and fund the legal research needed to get the cooperative structure right. And part of that is going to be designing the electoral systems.

It's definitely going to be hard and it may end up coming down to "ID verification required to vote". Not to use the platform, just to vote in board elections. I'd love to find a way to avoid that, but we can always do it if we have to.

The plan is to moderate the platform pretty heavily using a two layered moderation system: community moderation as the first layer and official moderation as a second layer that moderates the community moderation. That moderation will be very much aimed at keeping the platform as free of bots, spammers, and propagandists as possible.

So if we're successful in that, we may be able to avoid the intrusive verification by saying "It's an honor system and all active users in good standing are trusted to be honorable." But it remains to be seen whether we're successful enough in the moderation to even attempt that.

Or we may be able to come up with some other system to ensure it.

The other piece is that it's a multi-stakeholder cooperative. Users elect half the board, but the workers elect the other half. And with workers, it will be easy to restrict it to one worker one vote. So the workers can and will provide a safety backstop against user elections that go off the rails in one way or another.

eucyclos · 2025-11-19T14:09:11 1763561351

Do you have a plan for dealing with the clickbait problem?

dbingham · 2025-11-19T14:14:36 1763561676

I'm exploring various systems of community moderation.

Right now experimenting with a "demote" button that people are encouraged to use on: disinformation, misinformation, propaganda, spam, and slop.

Communities' default feed is just chronological, but it also has "Most Active" and "Most Recent Activity". Right now, Demote knocks things down the "Most Active" feed.

Eventually, a high enough percentage of demotes would result in posts being removed from public feeds. A second, higher threshold, would result in it being removed from all feeds.

Demote usage would be moderated, and removal thresholds could be appealed to the official moderation team. Users who abuse or misuse demote would lose the privilege.

It's an experiment and we'll see if it works. It's also really early. But the thing that Communities is doing differently is that the users will ultimately be in control through democratic elections of the board. And I expect moderation to be a frequent and recurring issue in elections. (You know, if the whole thing gets off the ground at all.)

talkingtab · 2025-11-19T16:07:41 1763568461

When driving a car, I often wish I could tag other cars with arrows that mark them as bad drivers. These are the ones who weave in and out over 4 lanes on an interstate highway. When enough tags get the car, they have to pull over and take a break or something.

One immediate problem is malicious or over zealous taggers. But it seems easy to build a system that if you are too enthusiastic, then you have to pull over and take a break.

But accumulated reputation seems a thing. And if it is universal read and write it seems beneficial. It is somewhat like reputation in real life.

This depends somewhat on identities with some degree of stickiness. If you can just change who you are then bad reputation is not a big deal. But if there is some cost to establishing and maintaining an identity ...

eucyclos · 2025-11-21T02:29:55 1763692195

interesting. I have the theory that most networks, and most social networks, are doomed to tolerate too much or too little deviance - sailing between 4chan Scylla or reddit Charybdis. I hope you manage to thread the needle!

dbingham · 2025-11-07T01:20:59 1762478459

See, now this is an excellent use of LLMs (if we're going to be using them at all). Low stakes if it gets shit wrong, but can provide some really useful and surprising answers!

One request, it would be nice to not have to add Goodreads, since I don't use it. I've love to be able to enter a couple of book titles or an author and just get recommendations!

costco · 2025-11-07T01:40:23 1762479623

You don't have to import your Goodreads profile. You can type titles and authors in the box and find books to add to the list that way.

dbingham · 2025-09-10T15:01:55 1757516515

This all comes down to "We can't have nice things in America because of our toxic mix of individualism and capitalism."

Because we insist on trying to privatize everything, refuse to provide a safe floor for people, and make poverty and mental health challenges moral issues (meaning we degrade people who experience them and leave them to fend for themselves) we create an environment where true community is impossible.

Unless, of course, we apply authoritarian and abusive policing controls against those we've left behind, rounding them up and sending them somewhere else. Which of course achieves a temporary "peace" at the cost of a deep insecurity and fear, because we all know the moment we slip or step out of line, we're gone.

It really is toxic and has led directly to society breaking down to the point where we're now falling into full scale fascism.

rurban · 2025-09-11T06:08:26 1757570906

You can have nice things also. Eg the inner city of Park City, Utah is also car free, and busses are running for free in the winter season.

dbingham · 2025-09-05T13:01:12 1757077272

This is really unreasonable entitlement. Expecting perpetually free cloud services of any kind is wholely unreasonable. Clouds have monthly costs. The only reason companies like Apple can offer them is because they are very well capitalized. They offer them to addict you. Small companies and startups that don't have access to cheap capital cannot afford to do that, and it's much more honest for them to not do that!

noir_lord · 2025-09-06T17:08:50 1757178530

Do they sell one that is functional without the cloud though.

I’m not buying any device that requires a paid subscription to a cloud to get full functionality since if that goes away so does the ability to use the device.

A policy that’s served me well having friends IRL who keep getting bitten by services/IoT changing/going away/end of life.

Sell me a functional product with a subscription and I’m interested, everything else, no way.

dbingham · 2025-09-05T12:56:29 1757076989

I feel like the real opportunity for e-ink displays is white boards. I would love a 4 ft x 3 ft e-ink display that I could mount on the wall with calendar and note widgets and that I could also draw on like a white board.

Plus, remote teams would really benefit from a shared whiteboard device. A device with infinite scroll in both directions and shared editing. I mean, really adding infinite scroll and collaborative editing to existing remarkable devices would cover that use case.

dbingham · 2025-08-14T19:32:08 1755199928

It also matters whether we are considering it a static $10 million or considering reality.

In reality, if you have $10 million, you put it in the S&P500 and make an average of 10% ($1 million) per year. Far more than inflation and more than enough to cover those things you're talking about unless you have a pretty extreme medical condition or very expensive hobbies.

dkural · 2025-08-14T19:52:33 1755201153

I agree with this directionally, however I think you'll make more like 7.2% per year, and inflation will be about 2.5% per year. You'll also likely pay about 30% in federal and local taxes in the USA on it since you're actually selling it to live on it (more on taxes later). So you'll pay 2.2% in taxes. So on average you'll get 7.2 - (2.5 + 2.2) = 2.5% of income. If you have $10M, you can withdraw about 250K a year in today's dollars every year. i.e next year you can withdraw 256.3K or so, and keep doing this to keep your current standard of living. In down years you may want to adjust / tighten belt a tiny bit to not veer off track too much. And you can get cute with taxes but not recommended. That loan interest will add up over time, and when it's time to actually pay those loans, you'll still sell stock and pay taxes on it, unless your offspring inherit both.. and who knows what the laws will be then.

bakkoting · 2025-08-14T21:27:35 1755206855

The 7.2% number is already adjusted for inflation. Historically the stock market has gotten about 10% nominal return, 6.5-7% real.

dkural · 2025-08-18T03:16:23 1755486983

Huh! I genuinely didn't know that and this makes me very happy.

rurp · 2025-08-14T20:06:26 1755201986

Agreed, but would caveat that the historical market returns happened as the world's dominant economic and technical powerhouse. The current trajectory is looking different, to put it mildly. The US is undermining nearly every advantage that led to such strong growth. Barring some massive pivot in the near future, medium term economic growth will most likely be lower.

stripe_away · 2025-08-14T20:19:34 1755202774

inflation was double-digits in the 70s.

and the S&P was flat at 1.6% for the decade

despite some pretty amazing technical innovations pocket calculator and microcomputer (Altair 8800), first email, pong, floppy disks (they were the standard for 20 years), VCR, cell phone (1973 Motorola), barcode scanners, rubiks cube, ...

https://www.modwm.com/lost-decade-of-the-1970s/

NoLinkToMe · 2025-08-14T22:43:56 1755211436

> and the S&P was flat at 1.6% for the decade

Nah not really.

Nominally S&P500 did 23% in the 70s, and 2.08% annualised, but financial returns are not just the stock prices, they're also dividends.

If you include and reinvest dividends, you'd have made 83% in the decade and 6.2% per year.

Its true inflation was high though, and an investment in Jan 1970 would've in real terms returned -1.1% a year after adjusting for inflation. If you continued investing equal amounts each year from 1970 to 1980, it'd actually be about -0.5%.

But no investment would've meant you lost half of all your money due to 7% average inflation, so investing would've been a pretty good idea, offsetting almost all inflation in the worst decade 50 years ago.

Also it's common knowledge to do a stock/bond split. Bond returns fared a bit better. -- and it should be said, the following decade inflation came way down and in nominal terms the S&P500 did +364% with dividends reinvested.

I do agree with your general point though, you can't just rely on a 10% annual average and spend that amount. The commonly referenced safe withdrawal rate (WR) of 4% is 2.5x less than the average S&P500 return for a good reason (based on a ton of monte carlo sims that indeed would lead to disastrous results at 10% WR in the 1970s).

phkahler · 2025-08-14T19:48:14 1755200894

Except the market is a bubble. It's going to pop within 10 years as the boomers retire and die. Thats assuming low inflation. With significant inflation the younger folks might afford to prop it up.

jama211 · 2025-08-14T20:17:40 1755202660

Even if that’s the case, with 10 million you have 100 years of 100k+ a year even if you can only barely stave off the rate of inflation.

misja111 · 2025-08-15T06:54:44 1755240884

Can you elaborate? Why is the market going to pop "as the boomers retire and die"?

dbingham · 2025-08-13T15:51:19 1755100279

The question is, how do we enforce this?

dbingham · 2025-07-21T01:55:59 1753062959

The same, in theory, applies to social media. But they've all enshittified in very similar ways now that they've captured their audiences. In theory there is intense competition between Meta, Twitter, TikTok, etc, but in actuality the same market forces drive the same enshittification across all of those platforms. They have convergent interests. If they all force more ads and suggested posts on you, they all make more money and you have no where to go.

People are reasonably worried that the same will happen to AI.

senko · 2025-07-21T06:51:54 1753080714

> The same, in theory, applies to social media.

It absolutely does not.

Your use of social network derives value from your network. If you switch, you have to convince everyone else to switch as well.

It's a tremendous barrier to switching.

LLMs are for the most part interchangeable commodity.

smokel · 2025-07-21T13:22:37 1753104157

Note that the comment you are replying to is speculating about the (not so distant) future. Be assured that companies will try their best to lock customers in.

One option is to add adverts to the generated output, and making the product cheaper than the competition. Another is to have all your cloud data preprocessed with LLMs in a non-portable way, so that changing will incur a huge cost.

abid786 · 2025-07-23T01:44:08 1753235048

More and more of the social networks are just the algorithm though - tiktok, X, Facebook, etc. How much of your feed does the average use personally know now?

dbingham · 2025-07-11T18:10:57 1752257457

Please correct me if I'm wrong, but my understand of the alien craft theory specifically for Oumuamua wasn't just because the object itself was new, but that it changed acceleration [1] without apparent off gassing in a way that isn't explained by our current understanding of orbital physics for a natural object.

It's not just "New object, must be aliens!" It's "This thing doesn't fit our understanding of orbital motion for natural objects, aliens is actually a rational, if still unlikely, possible explanation."

[1] https://en.wikipedia.org/wiki/1I/%CA%BBOumuamua#Non-gravitat...

ryanblakeley · 2025-07-11T18:20:22 1752258022

There were a number of anomalous characteristics including its shape, acceleration, rotation, origin, and reflectivity.

ceejayoz · 2025-07-11T19:08:49 1752260929

How do we know they're anomalous characteristics if it's literally the first one we've ever spotted? What is the normal shape of an interstellar comet core?

cubefox · 2025-07-11T19:23:36 1752261816

For example, being flat like a pancake is obviously highly unusual and very different from anything we have seen from stellar comets.

ceejayoz · 2025-07-11T19:29:01 1752262141

Stellar comets haven't been ejected from another solar system. We have vanishingly few examples of those, and we've not directly observed any up close.

"Flat as a pancake" is one of several theoretical possibilities from its light curve, not a known fact about the object.

"Highly unusual" in space tends to mean "there are a bunch, but we haven't seen them until now". In 1992, exoplanets were "highly unusual". Now they're everywhere.

Sharlin · 2025-07-12T11:00:47 1752318047

Yes, and the exoplanets we found first were highly unusual and not at all what we expected to find, which triggered tons of new research to amend our models of planetary system formation and dynamics. I’m not even sure what you’re trying to argue here – we found an object that did not fit our model of what things should look like, which is very curious and calls for an explanation. That’s how science works. Doesn’t mean it’s aliens. But “oh well maybe it’s just how things are back where it’s from” does not satisfy anyone.

ceejayoz · 2025-07-12T13:34:38 1752327278

I think we are actually in agreement.

I’m very onboard with “it was an interesting object and we should learn more”.

I object to UFO cranks jumping to “it was a starship” conclusions like Avi Loeb wants to. Just as I would have when those weird first exoplanets showed up.

zikzak · 2025-07-12T23:36:16 1752363376

Did he conclude it was a starship or argue we shouldn't dismiss out of hand that an object like this has a non-zero chance of being an artifact of another civilization?

ceejayoz · 2025-07-13T00:21:48 1752366108

I mean, he titled his book about it “Extraterrestrial: The First Sign of Intelligent Life Beyond Earth”.

https://en.wikipedia.org/wiki/Extraterrestrial:_The_First_Si...

zikzak · 2025-07-13T01:08:22 1752368902

Did he say it was definitely a starship?

ceejayoz · 2025-07-13T02:25:19 1752373519

Will no one rid me of this turbulent priest?

cubefox · 2025-07-12T22:13:08 1752358388

I object to calling Avi Loeb a "crank" just because he thinks it might be an UFO.

ceejayoz · 2025-07-13T00:20:02 1752366002

I object to not calling him a crank.

The rest of the scientific community seems to take the same position.

cubefox · 2025-07-11T19:45:33 1752263133

The highly unusual properties are such that they are genuinely hard to explain for astronomers. See my neighbouring comment.

TheBlight · 2025-07-11T21:33:34 1752269614

The same as the ones from this system. Borisov had the same characteristics.

ceejayoz · 2025-07-11T21:39:15 1752269955

> The same as the ones from this system.

Why would we assume non-interstellar comets are always the same as interstellar comets? Conditions obviously are a little different when something is ejected from a system and then spends millions of years in interstellar space.

> Borisov had the same characteristics.

We have a sample size of three thus far. Making conclusions right now is like saying all extrasolar planets are large gas giants because the first three were.

TheBlight · 2025-07-11T23:29:19 1752276559

We'd assume most interstellar objects are comets because that's which objects you find on the outskirts of a solar system and are the easiest to get kicked out. We'd assume they're mostly like our comets due to the Copernican principle. We shouldn't assume we're special.

ceejayoz · 2025-07-12T00:30:07 1752280207

> We'd assume they're mostly like our comets due to the Copernican principle.

We're still figuring out what our comets are like, let alone unusual ones spending a few million years in interstellar space. New types of comets(ish) bodies discovered in the 2000s:

https://en.wikipedia.org/wiki/Active_asteroid

https://en.wikipedia.org/wiki/Manx_comet

We've spotted ~5k out of an estimated trillion. Each one we've sent a probe to has brought surprises. The Oort cloud remains theoretical at this time, and the first Kuiper belt object other than Pluto/Charon was found in 1992. It would be deeply silly to think we know everything about our local comets, let alone unusual ones from elsewhere.

jerf · 2025-07-11T20:07:30 1752264450

The history of science is that every freaking time we look somewhere new, we find something new. It happens over, and over, and over, and over again. We have a really bad track record of predicting things in advance in new domains. The exceptions are leaping to your mind precisely because you've heard about them because they're the exceptions.

Also, to date, zero of those things have been "aliens".

So rushing to declare the first instance of what was completely obviously a new class of objects as "aliens" because it didn't behave like what we expected is not rational, because we should expect that new things don't behave like we expect. The odds that the first one of these we detect is also the one from aliens is just not a good bet.

I'd bet a tidy sum of money that in 25 years it'll simply be common knowledge that these class of objects sometimes have those characteristics because of some characteristic special to them. Probably something to do with having a lot of things that turn to gasses and exert accelerations on the object because they were never blown off by the solar wind or something because of them being in deep space for millions of years. Might be most of them, might be a small-but-respectable fraction, but I bet in hindsight this is recorded in the history books right next to "pulsars are alien beacons!" and with the exact same tone of lightly sneering contempt we hold for that now. To which I can only say to the future, let the record show we did not all think it was aliens.

dbingham · 2025-07-13T22:32:16 1752445936

It is every bit as irrational to dismissively rule out "aliens" as it is to rush to declare it the definitive answer.

It is a valid possible answer that should be included in the possibilities as we try to figure out what caused the acceleration. Right now, lots of things are being proposed. And many of them are seemingly being ruled out. It remains to be seen what the answer will be.

The physics and history of science books I read when I got my degree did not seem to include sneering contempt for those who thought pulsars could be alien in origin. I rather recall a tone of disappointment as they described how we figured out that they weren't alien. The Fermi paradox remains one of the great mysteries of astronomy and cosmology and a lot of people, both professional and amateur are still fascinated by it.

Extraordinary claims do require extraordinary evidence, but academics who sneeringly dismiss extraordinary claims as even a possibility are every bit as toxic to the rational advancement of science as those who advance those claims without enough evidence.

mellosouls · 2025-07-11T22:31:28 1752273088

Yes (a change in acceleration was reported), but even in the link you yourself provide the hypotheses are framed within standard physics, not alien technology.

The latter got more than its fair share of press because Harvard's Avi Loeb proposed it as potential evidence of ET.

He later claimed more evidence from potential spaceship bits he reckons he found from an ancient meteor, and seems to specialize in these sorts of claims. [1]

Like you say, not irrational but perhaps over-hyped by people who ought to know better...

[1]https://www.independent.co.uk/news/world/americas/avi-loeb-i...

dbingham · 2025-05-23T02:14:34 1747966474

It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase.

If that's the case, then I have a bad feeling for the state of our industry. My experience with LLMs is that their code does _not_ cut it. The hallucinations are still a serious issue, and even when they aren't hallucinating they do not generate quality code. Their code is riddled with bugs, bad architectures, and poor decisions.

Writing good code with an LLM isn't any faster than writing good code without it, since the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking. You have to spend more or less the same amount of time with the LLM understanding the code, thinking about the problems, and verifying its work (and then reprompting or redoing its work) as you would just writing it yourself from the beginning (most of the time).

Which means that all these companies that are firing workers and demanding their remaining employees use LLMs to increase their productivity and throughput are going to find themselves in a few years with spaghettified, bug-riddled codebases that no one understands. And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.

Of course, there could be an unforeseen new order of magnitude jump. There's always the chance of that and then my prediction would be invalid. But so far, what I see is a fast approaching plateau.

noduerme · 2025-05-23T02:41:45 1747968105

Wouldn't that be the best thing possible for our industry? Watching the bandwagoners and "vibe coders" get destroyed and come begging for actual thinking talent would be delicious. I think the bets are equal on whether later LLMs can unfuck current LLM code to the degree that no one needs to be re-hired... but my bet is on your side, that bad code collapses under its own weight. As does bad management in thrall to trends whose repercussions they don't understand. The scenario you're describing is almost too good. It would be a renaissance for the kind of thinking coders you're talking about - those of us who spend 90% of our time considering how to fit a solution to a domain and a specific problem - and it would scare the hell out of the next crop of corner suite assholes, essentially enshrining the belief that only smart humans can write code that performs on the threat/performance model needed to deal with any given problem.

>> the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking.

Unfortunately, this is now an extremely minority understanding of how we need to do our job - both among hirees and the people who hire them. You're lucky if you can find an employer who understands the value of it. But this is what makes a "10x coder". The unpaid time spent lying awake in bed, sleepless until you can untangle the real logic problems you'll have to turn into code the next day.

csomar · 2025-05-23T05:04:29 1747976669

That's not how real life works; you are thinking of a movie. Management will never let down of any power they accumulated until the place is completely ransacked. The Soviet Union is a cautionary tale, a relatively modern event and well documented.

noduerme · 2025-05-23T05:54:43 1747979683

I only work for companies where I have direct interaction with the owners. But I think that any business structure that begins to resemble a "soviet" type, where middle management accumulates all the power (and is scared of workers who have ideas) is inevitably going to collapse. If the way they try in the late 2020s to accumulate power is by replacing thoughtful coders with LLMs, they will collapse in a very dramatic, even catastrophic fashion. Which will be very funny to me. And it will result in their replacement, and the reinstatement of thoughtful code design.

A lot of garbage will have to be rewritten and a lot of poorly implemented logic re-thought. Again, I think a hard-learned lesson is in order, and it will be a great thing for our industry.

agoodusername63 · 2025-05-23T02:34:04 1747967644

I think theres still lots of room for huge jumps in many metrics. It feels like not too long ago that DeepSeek demonstrated that there was value in essentially recycling (Stealing, depending on your view) existing models into new ones to achieve 80% of what the industry had to offer for a fraction of the operating cost.

Researchers are still experimenting, I haven't given up hope yet that there will be multiple large discoveries that fundamentally change how we develop these LLMs.

I think I agree with the idea that current common strategies are beginning to scrape the bottom of the barrel though. We're starting to slow down a tad.

sponnath · 2025-05-24T01:44:45 1748051085

DeepSeek did more than just "recycle". Let's not downplay their achievement.

agoodusername63 · 2025-05-24T23:45:12 1748130312

Yeah I know but I feel like whenever I mention DeepSeek and don't say that I always get the most Anti-China bros in my face lol

adamtaylor_13 · 2025-05-23T03:09:08 1747969748

That’s funny, my experience has been the exact opposite.

Claude Code has single-handedly 2-3x my coding productivity. I haven’t even used Claude 4 yet so I’m pretty excited to try it out.

But even trusty ol 3.7 is easily helping me out out 2-3x the amount of code I was before. And before anyone asks, yes it’s all peer-reviewed and I read every single line.

It’s been an absolute game changer.

Also to your point about most engineering being thinking: I can test 4-5 ideas in the time it took me to test a single idea in the last. And once you find the right idea, it 100% codes faster than you do.

runekaagaard · 2025-05-23T05:42:01 1747978921

Yeah remember when people were using Claude 3.7... so oldschool man

icpmacdo · 2025-05-23T02:28:17 1747967297

"It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase."

SWE bench from ~30-40% to ~70-80% this year

elcritch · 2025-05-23T02:50:53 1747968653

Yet despite this all the LLMS I've tried struggle to scale beyond much more than a single module. They're vast improvements on that test perhaps, but in real life they still struggle to be coherent over larger projects and scales.

bckr · 2025-05-23T14:57:28 1748012248

> struggle to scale beyond much more than a single module

Yes. You must guide coding agents at the level of modules and above. In fact, you have to know good coding patterns and make these patterns explicit.

Claude 4 won’t use uv, pytest, pydantic, mypy, classes, small methods, and small files unless you tell it to.

Once you tell it to, it will do a fantastic job generating well-structured, type-checked Python.

viraptor · 2025-05-23T03:22:11 1747970531

Those are different kind of issues. Improving the quality of actions is what we're seeing here. Then for the larger projects/contexts the leaders will have to battle it out between the improved agents, or actually moving to something like RWKV and processing the whole project in one go.

morsecodist · 2025-05-23T03:35:00 1747971300

They may be different kinds of issues but they are the issues that actually matter.

avs733 · 2025-05-23T02:53:19 1747968799

3% to 40% is a 13x improvement

40% to 80% is a 2x improvement

It’s not that the second leap isn’t impressive, it just doesn’t change your perspective on reality in the same way.

viraptor · 2025-05-23T03:25:43 1747970743

Maybe... It will be interesting to see the improvements now compared to other benchmarks. Is 80->90% going to be an incremental fix with minimal impact on the next benchmark (same work but better), or is it going to be an overall 2x improvement on the remaining unsolved cases. (different approach tackling previously missed areas)

It really depends on how that remaining improvement happens. We'll see it soon though - every benchmark nearing 90% is being replaced with something new. SWE-verified is almost dead now.

energy123 · 2025-05-23T03:36:54 1747971414

80% to 100% would be an even smaller improvement but arguably the most impressive and useful (assuming the benchmark isn't in the training data)

andyferris · 2025-05-23T03:26:35 1747970795

I wouldn’t want to wait ages for Claude Code to fail 60% of the time.

A 20% risk seems more manageable, and the improvements speak to better code and problem solving skills around.

piperswe · 2025-05-23T02:29:23 1747967363

How much of that is because the models are optimizing specifically for SWE bench?

icpmacdo · 2025-05-23T02:32:26 1747967546

not that much because its getting better at all benchmarks

keeeba · 2025-05-23T06:26:26 1747981586

https://arxiv.org/abs/2309.08632

hodgehog11 · 2025-05-23T03:19:22 1747970362

Under what metrics are you judging these improvements? If you're talking about improving benchmark scores, as others have pointed out, those are increasing at a regular rate (putting aside the occasional questionable training practices where the benchmark is in the training set). But most individuals seem to be judging "order of magnitude jumps" in terms of whether the model can solve a very specific set of their use cases to a given level of satisfaction or not. This is a highly nonlinear metric, so changes will always appear to be incremental until suddenly it isn't. Judging progress in this way is alchemy, and leads only to hype cycles.

Every indication I've seen is that LLMs are continuing to improve, each fundamental limitation recognized is eventually overcome, and there are no meaningful signs of slowing down. Unlike prior statistical models which have fundamental limitations without solutions, I have not seen evidence to suggest that any particular programming task that can be achieved by humans cannot eventually be solvable by LLM variants. I'm not saying that they necessarily will be, of course, but I'd feel a lot more comfortable seeing evidence that they won't.

morsecodist · 2025-05-23T03:30:52 1747971052

I think it actually makes sense to trust your vibes more than benchmarks. The act of creating a benchmark is the hard part. If we had a perfect benchmark AI problems would be trivially solvable. Benchmarks are meaningless on their own, they are supposed to be a proxy for actual usefulness.

I'm not sure what is better than, can it do what I want? And for me the ratio of yes to no on that hasn't changed too much.

hodgehog11 · 2025-05-24T00:43:28 1748047408

I agree that this is a sensible judgement for practical use, but my point is that the vibes likely will change, it's just a matter of when. You can't draw a trendline on a nonlinear metric especially when you have no knowledge of the inflection point. Individual benchmarks are certainly fallible, and we always need better ones, but the aggregate of all of the benchmarks together (and other theoretical metrics not based on test data) is correlating reasonably well with opinion polling and these are all improving at a consistent rate. It's just that it's unclear when these model improvements will lead to the outcomes that you're looking for. When it happens, it will appear like a massive leap in performance, but really it's just a threshold being hit.

morsecodist · 2025-05-23T03:25:49 1747970749

I agree on the diminishing returns and that the code doesn't cut it on its own. I really haven't noticed a significant shift in quality in a while. I disagree on the productivity though.

Even for something like a script to do some quick debugging or answering a question it's been a huge boon to my productivity. It's made me more ambitious and take on projects I wouldn't have otherwise.

I also don't really believe that workers are currently being replaced by LLMs. I have yet to see a system that comes anywhere close to replacing a worker. I think these layoffs are part of a trend that started before the LLM hype and it's just a convenient narrative. I'm not saying that there will be no job loss as a result of LLMs I'm just not convinced it's happening now.

csomar · 2025-05-23T05:02:03 1747976523

> And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.

If the banking industry is any clue they'll get bailout from the government to prevent a "systemic collapse". There is a reason "everyone" is doing it especially with these governments. You get to be cool, you don't risk of missing out and if it blows, you let it blow on the tax payer expense. The only real risk for this system is China because they can now out compete the US industries.

sublimefire · 2025-05-23T08:23:52 1747988632

There are a couple of things where LLMs are OK from the business perspective. Even if they are so so you can still write large amounts of mediocre code without the need to consume libraries. Think about GPL’d code, no need to worry about that because one dev can rewrite those libraries into proprietary versions without licensing constraints. Another thing is that LLMs are OK for an average company with few engineers that need to ship mountains of code across platforms, they would make mistakes anyway so LLMs should not make it worse.

Horffupolde · 2025-05-23T02:43:14 1747968194

So you abandon university because you don’t make order of magnitude progress between semesters. It’s only clear in hindsight. Progress is logarithmic.

Davidzheng · 2025-05-23T04:08:19 1747973299

Disagree. The marginal returns are more in places where the LLMs are near skill ceilings.