I think the proper answer is to aim for the bots to get _negative_ utility value...

sigmoid10 · 2025-03-26T08:38:40 1742978320

This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.

hec126 · 2025-03-26T09:44:10 1742982250

You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.

rustc · 2025-03-26T11:38:23 1742989103

This would be terrible for accessibility for users using a screen reader.

mostlysimilar · 2025-03-26T16:35:49 1743006949

So would the site shutting down because AI bots are too much traffic.

MrResearcher · 2025-03-27T15:43:32 1743090212

<a aria-hidden="true"> ... </a> will result in a link ignored by screen readers.

rustc · 2025-03-28T09:13:40 1743153220

Removing elements that match `[hidden], [aria-hidden]` is the most trivial cleanup transform a crawler can do and I'm sure most crawlers already do that.

soco · 2025-03-26T10:30:56 1742985056

But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?

majewsky · 2025-03-26T12:15:53 1742991353

robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.

ccgreg · 2025-03-29T03:31:34 1743219094

I'd say a bigger problem is that people disagree about the meaning of nofollow and noindex.

sigmoid10 · 2025-03-26T16:29:18 1743006558

The detection and bypass is trivial: Access the site from two IPs, one disrespecting robots.txt. If the content changes, you know it's garbage.

delichon · 2025-03-26T10:52:47 1742986367

Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?

voidUpdate · 2025-03-26T11:23:07 1742988187

When was the last time you looked at robots.txt to find a page that wasn't linked anywhere else?

zzo38computer · 2025-03-26T18:33:51 1743014031

It was a while ago, and it was not deliberate (wget downloaded robots.txt as well as the files I requested, and I was able to find many other files due to that, some of which could not be accessed due to requiring a password, but some were interesting (although I did not use wget to copy those other files; I only wanted to copy the files I originally requested)).

sharlos201068 · 2025-03-26T11:50:46 1742989846

Crawlers aren't interested in fake pages that aren't linked to anywhere, they're crawling the same pages your users are viewing.

danielheath · 2025-03-26T13:05:31 1742994331

Adding a disallowed url to your robots.txt is a quick way to get a ton of crawlers to hit it, without linking to it from anywhere. Try it sometime.

gadflyinyoureye · 2025-03-26T14:07:27 1742998047

Tuesday. But I have odd hobbies.

brookst · 2025-03-26T12:06:48 1742990808

robots.txt is not a sitemap. If it worked that way you could just make a 5TB file linking to a billion pages that look like static links but are dynamically generated.

ccgreg · 2025-03-29T03:32:48 1743219168

robots.txt has a maximum relevant size of 500 kib.

gdcbe · 2025-03-26T06:50:59 1742971859

That’s vaguely what https://blog.cloudflare.com/ai-labyrinth/ is about

TonyTrapp · 2025-03-26T08:46:23 1742978783

We're affected by this. The only thing that would realistically work is the first suggestion. The most unscrupulous AI crawlers distribute their inhuman request rate over dozens of IPs, so every IP just makes 1-2 requests in total. And they use real-world browser user agents, so blocking those could lock out real users. However, sometimes they claim to be using really old Chrome versions, so I feel less bad about locking those out.

sokoloff · 2025-03-26T09:23:53 1742981033

> dozens of IPs, so every IP just makes 1-2 requests in total

Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.

lucb1e · 2025-03-26T09:45:33 1742982333

I'm also affected. I presume that this is per day, not just once, yet it's fewer requests than a human would often do so you cannot block it. I blocked 15 IP ranges containing 37 million IP addresses (most of them from Huawei's Singapore and mobile divisions, according to the IP address WHOIS data) because they did not respect robots.txt and didn't set a user agent identifier. This is not including several other scrapers that did set a user agent string but do not respect robots.txt (again, including Huawei's PetalBot). (Note, I've only blocked them from one specific service that proxies+caches data from a third party, which I'm caching precisely because the third party site struggled with load, so more load from these bots isn't helping)

That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks

aorth · 2025-03-26T18:34:22 1743014062

Parent probably meant hundreds or thousands of IPs.

Last week I had a web server with a high load. After some log analysis I found 66,000 unique IPs from residential ISPs in Brazil had made requests to the server in a few hours. I have broad rate limits on data center ISPs, but this kinda shocked me. Botnet? News coverage of the site in Brazil? No clue.

Edit: LOL didn't read the article unity after posting—they mention the Fedora Pagure server getting this traffic from Brazil last week too!

Rate limiting vast swathes of Google Cloud, Amazon EC2, Digital Ocean, Hetzner, Huawei, Alibaba, Tencent, and a dozen other data center ISPs by subnet has really helped keep the load on my web servers down.

Last year I had one incident with 14,000 unique IPs in Amazon Singapore making requests in one day. What the hell is that?

I don't even bother trusting user agents any more. My nginx config has gotten too complex over the years and I wish I didn't need all this arcane mapping and whatnot.

TonyTrapp · 2025-03-26T10:04:38 1742983478

If you are serving a Git repository browser and all of those IPs are hitting all the expensive endpoints such as git blame, it becomes something to worry about very quickly.

giantg2 · 2025-03-26T12:28:10 1742992090

That's probably per day, per bot. Now how does it look when there are thousands of bots? In most cases I think you're right, but I can also see how it can add up.

gchamonlive · 2025-03-26T10:04:47 1742983487

If I'm hosting my site independently, with a rented machine and a cloudflare CDN hosting my code on a self managed gitlab instance, how should I go about implementing this? Is there something plug and play I can drop into nginx that would do this work for me of serving bogus content and leaving my gitlab instance unscathed by bots?

lomonosov · 2025-03-26T09:31:51 1742981511

Russia is already doing poisoning with success, so it is a viable tactic!

https://www.heise.de/en/news/Poisoning-training-data-Russian...

lucb1e · 2025-03-26T09:36:07 1742981767

> Is your user-agent too suspicious?

Hello, it's me, your legitimate user who doesn't use one of the 4 main browsers. The internet gets more annoying every day on a simple android webview browser, I guess I'll have to go back to the fully-featured browser I migrated away from because it was much slower

> A request rate too inhuman?

I've run into this on five websites in the past 2 months, usually just from a single request that didn't have a referrer (because I clicked it from a chat, not because I block anything). When I email the owners, it's the typical "have you tried turning it off and on again" from the bigger sites, or on smaller sites "dunno but you somehow triggered our bot protection, should have expired by now [a day later] good luck using the internet further". Only one of the sites, Codeberg, actually gave a useful response and said they'd consider how to resolve when someone follows a link directly to a subpage and thus doesn't have a cookie set yet. Either way, blocks left and right are fun! More please!

jajko · 2025-03-26T08:52:07 1742979127

Thats absolutely brilliant f*cked up idea, poisoning AI while fending them off.

Gotta get my daily dose of bleach for enhanced performance, chatgpt said so.

usrnm · 2025-03-26T08:18:22 1742977102

Where are you going to get all that content? If it's static, it will get filtered out very fast, if it's dynamic and autogenerated, it might cost even more than just letting the crawler through

hec126 · 2025-03-26T09:45:24 1742982324

Generate it once every few weeks with LLaMa and then serve as static content?

j-bos · 2025-03-26T11:26:49 1742988409

This makes tons of sense, the AI trainers spend endless resources aligning their llms, the least we could do ia spend a few minutes aligning their owners. Fixing things at the incentive level.

agilob · 2025-03-26T07:08:54 1742972934

Also, lower upload rate to 5kb/s

PeterStuer · 2025-03-26T09:43:58 1742982238

You have clearly no idea on the incompetence so many public administrations have in configuring robots.txt for data that is actually created for and specifically meant to be consumed programatically (think rss and atom feeds, REST api endpoints etc.). Half the time the person setting up the robots.txt just blankedly blocks everything, and does not even know (or care) to exclude those.

seper8 · 2025-03-26T08:14:30 1742976870

I love this idea haha