More

TeMPOraL · 2026-01-13T23:54:25 1768348465

Yup. The problem was never with the technology replacing work, it was always with the social aspect of deploying it, that ends up pulling the rug under people whose livelihood depend on exchanging labor for money.

The luddites didn't destroy automatic looms because they hated technology; they did it because losing their jobs and seeing their whole occupation disappear ruined their lives and lives of their families.

The problem to fix isn't automation, but preventing it from destroying people's lives at scale.

TeMPOraL · 2026-01-13T23:45:11 1768347911

Influencers are, by definition, advertisers - and a particularly insidious, ugly bunch at that.

If we go by the vibe of this thread, it's yet another reason to avoid social media. You wouldn't want to reward people like this.

As for the broader topic, this segues into the worryingly popular fallacy of excluded middle. Just because you're not against something, doesn't mean you're supporting it. Being neutral, ambivalent, or plain old just not giving a fuck about a whole class of issues, is a perfectly legitimate place to be in. In fact, that's everyone's default position for most things, because humans have limited mental capacity - we can't have calculated views on every single thing in the world all the time.

TeMPOraL · 2026-01-13T23:39:10 1768347550

Dating documents you write on creation and update is always helpful. It doesn't mean one needs to write a chronological blog or have a target cadence.

TeMPOraL · 2026-01-13T23:27:14 1768346834

There isn't. There never was one, because vast majority of websites are actually selfish with respect to data, even when that's entirely pointless. You can see this even here, with how some people complain LLMs made them stop writing their blogs: turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.

Anyway, all that means there was never a critical mass of sites large enough for a default bulk data dump discovery to become established. This means even the most well-intentioned scrappers cannot reliably determine if such mechanism exist, and have to scrap per-page anyway.

VonGallifrey · 2026-01-13T23:48:09 1768348089

> turns out plenty of people say they write for others to read

LLMs are not people. They don't write blogs so that a company can profit from their writing by training LLMs on it. They write for others to read their ideas.

TeMPOraL · 2026-01-14T00:18:16 1768349896

LLMs aren't making their owners money by just idling on datacenters worth of GPU. They're making money by being useful for users that pay for access. The knowledge and insights from writings that go into training data all end up being read by people directly, as well as inform even more useful output and work benefiting even more people.

wtetzner · 2026-01-14T03:12:40 1768360360

Except the output coming from an LLM is the LLM's take on it, not the original source material. It's not the same thing. Not all writing is simply a collection of facts.

philipwhiuk · 2026-01-14T02:10:22 1768356622

And rarely cite their sources, thus affording the author not so much a crumb of benefit in kind.

username223 · 2026-01-14T03:34:32 1768361672

> turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.

I couldn't care less about "tracking and controlling the audience," but I have no interest in others using my words and photos to profit from slop generators. I make that clear in robots.txt and licenses, but they ignore both.

TeMPOraL · 2026-01-13T23:21:47 1768346507

Users are not being trained. Despite the seemingly dominant HN belief to the contrary, people use LLMs for interacting with information (on the web or otherwise) because they work. SOTA LLM services are just that good.

TeMPOraL · 2026-01-13T23:09:07 1768345747

So perhaps it's time to standardize that.

squigz · 2026-01-13T23:18:25 1768346305

I'm not entirely sure why people think more standards are the way forward. The scrapers apparently don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?

TeMPOraL · 2026-01-13T23:36:54 1768347414

There is no standard, well-known way for a website to advertise, "hey, here's a cached data dump for bulk download, please use that instead of bulk scraping". If they were, I'd expect the major AI companies and other users[0] to use that method for gathering training data[1]. They have compelling reasons to: it's cheaper for them, and cultivates goodwill instead of burning it.

This also means that right now, it could be much easier to push through such standard than ever before: there are big players who would actually be receptive to it, so even few not-entirely-selfish actors agreeing on it might just do the trick.

--

[0] - Plenty of them exist. Scrapping wasn't popularized by AI companies, it's standard practice of on-line business in competitive markets. It's the digital equivalent of sending your employees to competing stores undercover.

[1] - Not to be confused with having an LLM scrap specific page for some user because the user requested it. That IMO is a totally legitimate and unfairly penalized/villified use case, because LLM is acting for the user - i.e. it becomes a literal user agent, in the same sense that web browser is (this is the meaning behind the name of "User-Agent" header).

ethin · 2026-01-14T00:19:01 1768349941

You do realize that these AI scrapers are most likely written by people who have no idea what they're doing right? Or they just don't care? If they were, pretty much none of the problems these things have caused would exist. Even if we did standardize such a thing, I doubt they would follow it. After all, they think they and everyone else has infinite resources so they can just hammer websites forever.

fartfeatures · 2026-01-14T00:28:16 1768350496

I realise you are making assertions for which you have no evidence. Until a standard exists we can't just assume nobody will use it, particularly when it makes the very task they are scraping for simpler and more efficient.

ethin · 2026-01-14T05:07:55 1768367275

> I realise you are making assertions for which you have no evidence.

We do have evidence, which is their current behavior. If they are happy ignoring robots.txt (and also ignoring copyright law), what gives you the belief that they magically won't ignore this new standard? Sure, it in theory might save them money, but if there's one thing that I think is blatantly obvious it is that money isn't what these companies care about because people just keep turning on the money generator. If they did care about it, they wouldn't be spending far more than they earn, and they wouldn't be creating circular economies to try to justify their existences. If my assertion has no evidence, I don't exactly see how yours does either, especially since we have seen that these companies will do anything if it means getting what they want.

edoceo · 2026-01-14T00:48:02 1768351682

I'm in favor of /.well-known/[ai|llm].txt or even a JSON or (gasp!) XML.

Or even /.well-known/ai/$PLATFORM.ext which would have the instructions.

Could even be "bootstrapped" from /robots.txt

TeMPOraL · 2026-01-13T22:40:39 1768344039

Reality doesn't have a distinction between "code" and "data"; those are categories of convenience, and don't even have a proper definition (what is code and what is data depends on who's asking and why). Any such distinction requires mechanically enforcing it; AI won't have it, because it's not natural, and adding it destroys generality of the model.

djaouen · 2026-01-13T23:47:14 1768348034

OK, then sequence your DNA and send it to me. I will make sure to use it as code!

TeMPOraL · 2026-01-14T00:20:47 1768350047

Haha. But DNA is a very good example of what I'm talking about. It's both "code" and "data" at the same time - or rather, a perfect demonstration that these concepts don't exist in nature.

djaouen · 2026-01-14T00:26:03 1768350363

Yes, but for me to use your DNA as code would be a major malfunction!

TeMPOraL · 2026-01-14T00:48:40 1768351720

I get the joke, but it's also an incredibly interesting topic to ponder. Remember "Reflections on Trusting Trust"? Now consider that DNA itself needs a complex biomolecular machine to "compile" it into cells and organisms, and that this also embeds in them copies of the "compiler" itself. This raises the question of whether, and how much, information needed to build the organism is not explicitly encoded anywhere in the DNA itself, and instead accumulates in the replication mechanism and gets carried over implicitly.

So for you to successfully use my DNA as code, without also borrowing the compiler from my body, would be a major scientific result, shining light on the questions outlined above.

So in short: I'm happy to contribute my DNA if you cite me as co-author on the resulting paper :P.

TeMPOraL · 2026-01-13T22:34:35 1768343675

> The problem is, so much of what people want from these things involves having all three.

Pretty much. Also there's no way of "securing" LLMs without destroying the quality that makes them interesting and useful in the first place.

I'm putting "securing" in scare quotes because IMO it's fool's errand to even try - LLMs are fundamentally not securable like regular, narrow-purpose software, and should not be treated as such.

TeMPOraL · 2026-01-13T22:30:24 1768343424

That's exactly what one does with their employees when one deploys "credential vaults", so?

SahAssar · 2026-01-13T22:35:05 1768343705

Employees are under contract and are screened for basic competence. LLMs aren't and can't be.

TeMPOraL · 2026-01-13T22:36:43 1768343803

> Employees are under contract and are screened for basic competence. LLMs aren't

So perhaps they should be.

> and can't be.

Ah but they must, because there's not much else you can do.

You can't secure LLMs like they were just regular, narrow-purpose software, because they aren't. They're by nature more like little people on a chip (this is an explicit design goal) - and need to be treated accordingly.

SahAssar · 2026-01-13T22:43:10 1768344190

> So perhaps they should be.

Unless both the legalities and technology radically change they will not be. And the companies building them will not take on the burden since the technology has proved to be so unpredictable (partially by design) and unsafe.

> designed to be more like little people on a chip - and need to be treated accordingly

Deeply unpredictable and unsafe people on a chip, so not the sort that I generally want to trust secrets with.

I don't think it's that complex, you can have secure systems or you can have current gen LLMs. You can't have both in the same place.

TeMPOraL · 2026-01-13T22:52:03 1768344723

> Deeply unpredictable and unsafe people on a chip, so not the sort that I generally want to trust secrets with.

Very true when comparing to acquaintances, but at a scale of any company or system except the tiniest ones, you can't blindly trust people in general either. Building systems involving people and LLMs is pretty similar.

> I don't think it's that complex, you can have secure systems or you can have current gen LLMs. You can't have both in the same place.

That is, indeed, the key. My point is that, unlike the popular opinion in threads like this, it does not follow that we need to give up on LLMs, or that we need to fix the security issues. The former is undesirable, the latter is fundamentally impossible.

What we need is what we've been doing ever since civilization took shape, ever since we've started building machines: recognize that automatons and people are different kinds of components, with different reliability and security characteristics. You can't blindly substitute one for the other, but there are ways to make them work together. Most systems we've created are of that nature.

What people still get wrong is treating LLMs as "automatons" components. They're not, they're "people" components.

SahAssar · 2026-01-13T23:01:37 1768345297

I think I generally agree, but I also think that treating them like people means that you expect reason, intelligence and a way to interrogate their way of "thinking" (very broad quotes here).

I think LLMs are to be treated as something completely separate from both predictable machines ("automatons") and people. They have separate concerns and fitness for a use-case than both existing categories.

majormajor · 2026-01-14T01:01:16 1768352476

Sooo the primary way we enforce contracts and laws against people are things like fines and jail time.

How would you apply the threat of those to "little people on a chip", exactly?

Imagine if any time you hired someone there was a risk that they'd try to steal everything they could from your company and then disappear forever with you having no way to hold them to account? You'd probably stop hiring people you didn't already deeply trust!

Strict liability for LLM service providers? Well, that's gonna be a non-starter unless there's a lot of MAJOR issues caused by LLMs (look at how little we care about identity theft and financial fraud currently).

TeMPOraL · 2026-01-13T17:35:14 1768325714

The old problem of "just because you can, doesn't mean you should".