Hacker News new | past | comments | ask | show | jobs | submit login

I think the proper answer is to aim for the bots to get _negative_ utility value from visiting our sites, that is poisoning their well, not just zero value, that is to block them.

Did you try to GET a canary page forbidden in robots.txt? Very well, have a bucket load of articles on the benefits of drinking bleach.

Is your user-agent too suspicious? No problem, feel free to scrape my insecure code (google "emergent misalignment" for more info).

A request rate too inhuman? Here, take those generated articles about positive effect of catching measles on performance in bed.

And so on, and so forth ...

Nepenthes is nice, but word salad can be detected easily. It needs a feature that pre-generates linguistically plausible but factically garbage text via open models




This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.


You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.


This would be terrible for accessibility for users using a screen reader.


So would the site shutting down because AI bots are too much traffic.


<a aria-hidden="true"> ... </a> will result in a link ignored by screen readers.


Removing elements that match `[hidden], [aria-hidden]` is the most trivial cleanup transform a crawler can do and I'm sure most crawlers already do that.


But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?


robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.


I'd say a bigger problem is that people disagree about the meaning of nofollow and noindex.


The detection and bypass is trivial: Access the site from two IPs, one disrespecting robots.txt. If the content changes, you know it's garbage.


Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?


When was the last time you looked at robots.txt to find a page that wasn't linked anywhere else?


It was a while ago, and it was not deliberate (wget downloaded robots.txt as well as the files I requested, and I was able to find many other files due to that, some of which could not be accessed due to requiring a password, but some were interesting (although I did not use wget to copy those other files; I only wanted to copy the files I originally requested)).


Crawlers aren't interested in fake pages that aren't linked to anywhere, they're crawling the same pages your users are viewing.


Adding a disallowed url to your robots.txt is a quick way to get a ton of crawlers to hit it, without linking to it from anywhere. Try it sometime.


Tuesday. But I have odd hobbies.


robots.txt is not a sitemap. If it worked that way you could just make a 5TB file linking to a billion pages that look like static links but are dynamically generated.


robots.txt has a maximum relevant size of 500 kib.


That’s vaguely what https://blog.cloudflare.com/ai-labyrinth/ is about


We're affected by this. The only thing that would realistically work is the first suggestion. The most unscrupulous AI crawlers distribute their inhuman request rate over dozens of IPs, so every IP just makes 1-2 requests in total. And they use real-world browser user agents, so blocking those could lock out real users. However, sometimes they claim to be using really old Chrome versions, so I feel less bad about locking those out.


> dozens of IPs, so every IP just makes 1-2 requests in total

Dozens of IPs making 1-2 requests per IP hardly seems like something to spend time worrying about.


I'm also affected. I presume that this is per day, not just once, yet it's fewer requests than a human would often do so you cannot block it. I blocked 15 IP ranges containing 37 million IP addresses (most of them from Huawei's Singapore and mobile divisions, according to the IP address WHOIS data) because they did not respect robots.txt and didn't set a user agent identifier. This is not including several other scrapers that did set a user agent string but do not respect robots.txt (again, including Huawei's PetalBot). (Note, I've only blocked them from one specific service that proxies+caches data from a third party, which I'm caching precisely because the third party site struggled with load, so more load from these bots isn't helping)

That's up to 37e6/24/60/60 = 430 requests per second if they all do 1 request per day on average. Each active IP address actually does more, some of them a few thousand per year, some of them a few dozen per year; but thankfully they don't unleash the whole IP range on me at once, it occasionally rotates through to new ranges to bypass blocks


Parent probably meant hundreds or thousands of IPs.

Last week I had a web server with a high load. After some log analysis I found 66,000 unique IPs from residential ISPs in Brazil had made requests to the server in a few hours. I have broad rate limits on data center ISPs, but this kinda shocked me. Botnet? News coverage of the site in Brazil? No clue.

Edit: LOL didn't read the article unity after posting—they mention the Fedora Pagure server getting this traffic from Brazil last week too!

Rate limiting vast swathes of Google Cloud, Amazon EC2, Digital Ocean, Hetzner, Huawei, Alibaba, Tencent, and a dozen other data center ISPs by subnet has really helped keep the load on my web servers down.

Last year I had one incident with 14,000 unique IPs in Amazon Singapore making requests in one day. What the hell is that?

I don't even bother trusting user agents any more. My nginx config has gotten too complex over the years and I wish I didn't need all this arcane mapping and whatnot.


If you are serving a Git repository browser and all of those IPs are hitting all the expensive endpoints such as git blame, it becomes something to worry about very quickly.


That's probably per day, per bot. Now how does it look when there are thousands of bots? In most cases I think you're right, but I can also see how it can add up.


If I'm hosting my site independently, with a rented machine and a cloudflare CDN hosting my code on a self managed gitlab instance, how should I go about implementing this? Is there something plug and play I can drop into nginx that would do this work for me of serving bogus content and leaving my gitlab instance unscathed by bots?


Russia is already doing poisoning with success, so it is a viable tactic!

https://www.heise.de/en/news/Poisoning-training-data-Russian...


> Is your user-agent too suspicious?

Hello, it's me, your legitimate user who doesn't use one of the 4 main browsers. The internet gets more annoying every day on a simple android webview browser, I guess I'll have to go back to the fully-featured browser I migrated away from because it was much slower

> A request rate too inhuman?

I've run into this on five websites in the past 2 months, usually just from a single request that didn't have a referrer (because I clicked it from a chat, not because I block anything). When I email the owners, it's the typical "have you tried turning it off and on again" from the bigger sites, or on smaller sites "dunno but you somehow triggered our bot protection, should have expired by now [a day later] good luck using the internet further". Only one of the sites, Codeberg, actually gave a useful response and said they'd consider how to resolve when someone follows a link directly to a subpage and thus doesn't have a cookie set yet. Either way, blocks left and right are fun! More please!


Thats absolutely brilliant f*cked up idea, poisoning AI while fending them off.

Gotta get my daily dose of bleach for enhanced performance, chatgpt said so.


Where are you going to get all that content? If it's static, it will get filtered out very fast, if it's dynamic and autogenerated, it might cost even more than just letting the crawler through


Generate it once every few weeks with LLaMa and then serve as static content?


This makes tons of sense, the AI trainers spend endless resources aligning their llms, the least we could do ia spend a few minutes aligning their owners. Fixing things at the incentive level.


Also, lower upload rate to 5kb/s


You have clearly no idea on the incompetence so many public administrations have in configuring robots.txt for data that is actually created for and specifically meant to be consumed programatically (think rss and atom feeds, REST api endpoints etc.). Half the time the person setting up the robots.txt just blankedly blocks everything, and does not even know (or care) to exclude those.


I love this idea haha




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: