Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

i like that Veritasium vid a lot, i've watched it a couple times. The thing is, there's no way to retaliate against a crawler ignoring robots.txt. IP bans don't work, user agent bans don't work, there's no human to shame on social media ether. If there's no way to retaliate or provide some kind of meaningful negative feedback then the whole thing breaks down. Back to the Veritasium video, if a crawler defects they reap the reward but there's no way for the content provider to defect so the crawler defects 100% of the time and gets 100% of the defection points. I can't remember when i first read the rfp for robots.txt but I do remember finding it strange that it was a "pretty please" request against a crawler that has a financial incentive to crawl as much as it can. Why even go through the effort to type it out?

EDIT: i thought about it for a min, i think in the olden days a crawler crawling every path through a website could yield an inferior search index. So robots.txt gave search engines a hint on what content was valuable to index. The content provider gained because their SEO was better (and cpu util. lower) and the search engine gained because their index was better. So there was an advantage to cooperation then but with crawlers feeding LLMs that isn't the case.



No robots.txt can't fix this.

Have you tried Anubis? It was all over the internet a few months ago. I wonder if it actually works well. https://github.com/TecharoHQ/anubis


This is a really cool tool. I haven't seen it before. Thank you for sharing it!

On their README.md they state:

> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.

I love the idea!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: