This is easy to say until you're an immigrant worker in a foreign country - something one probably worked for their entire life up to that point - risking it all (and potentially wrecking the life of their entire family) just to stop some random utility from having a Copilot button. It's not "this software will be used to kill people", it's more like "there's this extra toolbar which nobody uses".
I hadn't made more solid connections between the current state of software and industry, the subjugation of immigrants, and the death of the American neoliberal order until this comment thread but it here it lies bare, naked, and essentially impossible to ignore. With regards to the whole picture, there's no good or moral place to "RETVRN" to in a nostalgic sense. The one question that keeps ringing through my head as I see the world in constant upheaval, and my one refuge in meaning, technical craftsmanship, tumbling, is: Why did I not see this coming?
Because the society in US is arranged as a competition with no safety net and where your employer has a disproportionate amount of influence on your well being and the happiness of your kids.
I'm not going to give up $1M in total comp and excellent insurance for my family because you and I don't like where AI is going.
Just having the option of giving up $1 million in compensation put one far far far above meaningful worries about your well-being and the happiness of your kids.
I'll have to explain it to the wife: "well, you see, we cant live in this house anymore because AI in Notepad was just too much".
I'll dial up my ethical and moral stance on software up to 11 when I see a proper social safety net in this country, with free healthcare and free education.
And if we cant all agree on having even those vital things for free, then relying on collective agreement on software issues will never work in practice so my sacrifice would be for nothing. I would just end up being the dumb idealist.
I'm in a similar but more ridiculous reason. My reasonably modern hardware should support windows 11, but I get "disk not supported" because apparently I once picked the "wrong" bootloader?
I can't be arsed, if I'm going to have to fiddle around getting that working I might as well move to linux.
Much greater than now, given the open discoverability of the original post here, versus the walled-off content we have today, locked away in discord servers and the like.
Furthermore, the act of replying to that post will have bumped it right back to the top for everyone to see.
I agree with this. We are much missing these forums with civil replies and clouded behind "influencer" culture, which is optimized for incentives. Pure discussions as in this example are such a stalwarts of open web.
On the other hand, small websites and forums can disappear but that openness allows platform like archive.org to capture and "fossilize" them.
These forums still exist. Typically with much older and mature discussions, as the users have aged alongside the forums. Nothing is stopping you from joining them now.
My Something Awful forums account is over 25 years old at this point. The software and standards and moderation style is approximately unchanged, complete with 10 dollar sign-up fee to keep out the spam.
A model or new model version X is released, everyone is really impressed.
3 months later, "Did they nerf X?"
It's been this way since the original chatGPT release.
The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.
This is not always true. LLMs do get nerfed, and quite regularly, usually because they discover that users are using them more than expected, because of user abuse or simply because it attract a larger user base. One of the recent nerfs is the Gemini context window, drastically reduced.
What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.
Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.
Interesting! I was just thinking about pinging the creator of simple-bench.com and asking them if they intend to re-benchmark models after 3 months. I've noticed, in particular, Gemini models dramatically reducing in quality after the initial hype cycle. Gemini 3 Pro _was_ my top performer and has slowly reduced to 'is it worth asking', complete with gpt-4o style glazing. It's been frustrating. I had been working on a very custom benchmark and over the course of it Gemini 3 Pro and Flash both started underperforming by 20% or more. I wondered if I had subtle broken my benchmark but ultimately started seeing the same behavior in general online queries (Google AI Studio).
> What we need is an open and independent way of testing LLMs
I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.
We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc
To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.
I usually agree with this. But I am using the same workflows and skills that were a breeze for Claude, but are causing it to run in cycles and require intervention.
This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.
Also people who were lucky and had lots of success early on but then start to run into the actual problems of LLMs will experience that as "It was good and then it got worse" even when it didn't actually.
If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.
People are really failing to understand the probabilistic nature of all of this.
"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.
Opus was a non-deterministic probability machine in the past, present and the foreseeable future. The variance eventually shows up when you push it hard.
Eh, I've definitely had issues where Claude can no longer easily do what it's previously done. That's with constant documenting things in appropriate markdown files well and resetting context here and there to keep confusion minimal.
Indeed, increasing the incentive for companies to reject ( and then sometimes silently fix anyway ) even the valid reports would only increase further misery for everyone.
reply