Yup. The problem was never with the technology replacing work, it was always with the social aspect of deploying it, that ends up pulling the rug under people whose livelihood depend on exchanging labor for money.
The luddites didn't destroy automatic looms because they hated technology; they did it because losing their jobs and seeing their whole occupation disappear ruined their lives and lives of their families.
The problem to fix isn't automation, but preventing it from destroying people's lives at scale.
Influencers are, by definition, advertisers - and a particularly insidious, ugly bunch at that.
If we go by the vibe of this thread, it's yet another reason to avoid social media. You wouldn't want to reward people like this.
As for the broader topic, this segues into the worryingly popular fallacy of excluded middle. Just because you're not against something, doesn't mean you're supporting it. Being neutral, ambivalent, or plain old just not giving a fuck about a whole class of issues, is a perfectly legitimate place to be in. In fact, that's everyone's default position for most things, because humans have limited mental capacity - we can't have calculated views on every single thing in the world all the time.
There isn't. There never was one, because vast majority of websites are actually selfish with respect to data, even when that's entirely pointless. You can see this even here, with how some people complain LLMs made them stop writing their blogs: turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.
Anyway, all that means there was never a critical mass of sites large enough for a default bulk data dump discovery to become established. This means even the most well-intentioned scrappers cannot reliably determine if such mechanism exist, and have to scrap per-page anyway.
> turns out plenty of people say they write for others to read
LLMs are not people. They don't write blogs so that a company can profit from their writing by training LLMs on it. They write for others to read their ideas.
LLMs aren't making their owners money by just idling on datacenters worth of GPU. They're making money by being useful for users that pay for access. The knowledge and insights from writings that go into training data all end up being read by people directly, as well as inform even more useful output and work benefiting even more people.
Except the output coming from an LLM is the LLM's take on it, not the original source material. It's not the same thing. Not all writing is simply a collection of facts.
> turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.
I couldn't care less about "tracking and controlling the audience," but I have no interest in others using my words and photos to profit from slop generators. I make that clear in robots.txt and licenses, but they ignore both.
Users are not being trained. Despite the seemingly dominant HN belief to the contrary, people use LLMs for interacting with information (on the web or otherwise) because they work. SOTA LLM services are just that good.
I'm not entirely sure why people think more standards are the way forward. The scrapers apparently don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?
There is no standard, well-known way for a website to advertise, "hey, here's a cached data dump for bulk download, please use that instead of bulk scraping". If they were, I'd expect the major AI companies and other users[0] to use that method for gathering training data[1]. They have compelling reasons to: it's cheaper for them, and cultivates goodwill instead of burning it.
This also means that right now, it could be much easier to push through such standard than ever before: there are big players who would actually be receptive to it, so even few not-entirely-selfish actors agreeing on it might just do the trick.
--
[0] - Plenty of them exist. Scrapping wasn't popularized by AI companies, it's standard practice of on-line business in competitive markets. It's the digital equivalent of sending your employees to competing stores undercover.
[1] - Not to be confused with having an LLM scrap specific page for some user because the user requested it. That IMO is a totally legitimate and unfairly penalized/villified use case, because LLM is acting for the user - i.e. it becomes a literal user agent, in the same sense that web browser is (this is the meaning behind the name of "User-Agent" header).
You do realize that these AI scrapers are most likely written by people who have no idea what they're doing right? Or they just don't care? If they were, pretty much none of the problems these things have caused would exist. Even if we did standardize such a thing, I doubt they would follow it. After all, they think they and everyone else has infinite resources so they can just hammer websites forever.
I realise you are making assertions for which you have no evidence. Until a standard exists we can't just assume nobody will use it, particularly when it makes the very task they are scraping for simpler and more efficient.
> I realise you are making assertions for which you have no evidence.
We do have evidence, which is their current behavior. If they are happy ignoring robots.txt (and also ignoring copyright law), what gives you the belief that they magically won't ignore this new standard? Sure, it in theory might save them money, but if there's one thing that I think is blatantly obvious it is that money isn't what these companies care about because people just keep turning on the money generator. If they did care about it, they wouldn't be spending far more than they earn, and they wouldn't be creating circular economies to try to justify their existences. If my assertion has no evidence, I don't exactly see how yours does either, especially since we have seen that these companies will do anything if it means getting what they want.
Reality doesn't have a distinction between "code" and "data"; those are categories of convenience, and don't even have a proper definition (what is code and what is data depends on who's asking and why). Any such distinction requires mechanically enforcing it; AI won't have it, because it's not natural, and adding it destroys generality of the model.
Haha. But DNA is a very good example of what I'm talking about. It's both "code" and "data" at the same time - or rather, a perfect demonstration that these concepts don't exist in nature.
I get the joke, but it's also an incredibly interesting topic to ponder. Remember "Reflections on Trusting Trust"? Now consider that DNA itself needs a complex biomolecular machine to "compile" it into cells and organisms, and that this also embeds in them copies of the "compiler" itself. This raises the question of whether, and how much, information needed to build the organism is not explicitly encoded anywhere in the DNA itself, and instead accumulates in the replication mechanism and gets carried over implicitly.
So for you to successfully use my DNA as code, without also borrowing the compiler from my body, would be a major scientific result, shining light on the questions outlined above.
So in short: I'm happy to contribute my DNA if you cite me as co-author on the resulting paper :P.
> The problem is, so much of what people want from these things involves having all three.
Pretty much. Also there's no way of "securing" LLMs without destroying the quality that makes them interesting and useful in the first place.
I'm putting "securing" in scare quotes because IMO it's fool's errand to even try - LLMs are fundamentally not securable like regular, narrow-purpose software, and should not be treated as such.
> Employees are under contract and are screened for basic competence. LLMs aren't
So perhaps they should be.
> and can't be.
Ah but they must, because there's not much else you can do.
You can't secure LLMs like they were just regular, narrow-purpose software, because they aren't. They're by nature more like little people on a chip (this is an explicit design goal) - and need to be treated accordingly.
Unless both the legalities and technology radically change they will not be. And the companies building them will not take on the burden since the technology has proved to be so unpredictable (partially by design) and unsafe.
> designed to be more like little people on a chip - and need to be treated accordingly
Deeply unpredictable and unsafe people on a chip, so not the sort that I generally want to trust secrets with.
I don't think it's that complex, you can have secure systems or you can have current gen LLMs. You can't have both in the same place.
> Deeply unpredictable and unsafe people on a chip, so not the sort that I generally want to trust secrets with.
Very true when comparing to acquaintances, but at a scale of any company or system except the tiniest ones, you can't blindly trust people in general either. Building systems involving people and LLMs is pretty similar.
> I don't think it's that complex, you can have secure systems or you can have current gen LLMs. You can't have both in the same place.
That is, indeed, the key. My point is that, unlike the popular opinion in threads like this, it does not follow that we need to give up on LLMs, or that we need to fix the security issues. The former is undesirable, the latter is fundamentally impossible.
What we need is what we've been doing ever since civilization took shape, ever since we've started building machines: recognize that automatons and people are different kinds of components, with different reliability and security characteristics. You can't blindly substitute one for the other, but there are ways to make them work together. Most systems we've created are of that nature.
What people still get wrong is treating LLMs as "automatons" components. They're not, they're "people" components.
I think I generally agree, but I also think that treating them like people means that you expect reason, intelligence and a way to interrogate their way of "thinking" (very broad quotes here).
I think LLMs are to be treated as something completely separate from both predictable machines ("automatons") and people. They have separate concerns and fitness for a use-case than both existing categories.
Sooo the primary way we enforce contracts and laws against people are things like fines and jail time.
How would you apply the threat of those to "little people on a chip", exactly?
Imagine if any time you hired someone there was a risk that they'd try to steal everything they could from your company and then disappear forever with you having no way to hold them to account? You'd probably stop hiring people you didn't already deeply trust!
Strict liability for LLM service providers? Well, that's gonna be a non-starter unless there's a lot of MAJOR issues caused by LLMs (look at how little we care about identity theft and financial fraud currently).
The luddites didn't destroy automatic looms because they hated technology; they did it because losing their jobs and seeing their whole occupation disappear ruined their lives and lives of their families.
The problem to fix isn't automation, but preventing it from destroying people's lives at scale.
reply