Copying my comment here for additional discussion:
Worth noting that most of these bots are 'good bots' (i.e. they will obey robots.txt). So you can avoid the nginx resource usage entirely by adding suitable robots.txt entries.
I think using nginx tests like this could have negative effects on showing OpenGraph metadata (including images).
If choosing this approach however I would probably respond with a 403 code to mark forbidden as bots are more likely to continue making attempts if they think the server would come back online.
I advise 402 Payment Required \n Location: mailto:payment-offers@yourdomain. It’s still 4xx so they’ll probably know not to hit it repeatedly, and if they reach out to negotiate payment, you have a choice whether to accept or not.
If playing around with blocking bots here is one that will block some legit and friendly bots and some bad ones too. This will do nothing for headless chrome acting as a bot. Only use this on silly hobby sites, do not use in production even though most load balancers can do this and much more.
In the main site config redirect anyone not using HTTP/2.0. GoogleBot still doesnt use HTTP/2.0 so this will block Google. Bing is OK though. One could instead use variables to make this multi-condition and make exceptions for their CIDR blocks. Point "auth." DNS record to the same IP and ensure you have a cert for it or a wildcard cert.
# in main TLS site config:
# replace apex with your domain and tld with its tld.
if ($server_protocol != HTTP/2.0) { return 302 https://auth.apex.tld$request_uri; }
Then in your "auth" domain use the same config as the main site minus the redirect but then add basic authentication. Anyone not using HTTP/2.0 can still access the site if they know the right username/password. If you get a lot of bots then have an init script copy the password file into /dev/shm and reference it from there in NGinx to avoid the disk reads.
# then in the auth.apex.tld config.
# optionally give a hint replacing i_heart_bots with name_blah_pass_blah
auth_delay 2s;
location / {
auth_basic "i_heart_bots"; auth_basic_user_file /etc/nginx/.pw;
}
This will block some API command line tools, most bots good or bad, some scanning tools. Some bots will give up prior to 2 seconds so you will get a status 499 instead of 401 in the access logs. Only do this on silly hobby sites. Do not use in production. Only people wearing a T-Shirt like this one [1] may do this in production.
One may be surprised to find that most bots use old libraries that are not HTTP/2.0 enabled. When they catch up we can replace this logic using HTTP/3.0 and UDP. Beyond that we can force people to win a game of tic-tac-toe or Doom over Javascript.
The most permanent and effective solutions (in terms of minimizing adversarial activity over time and destroying the value of what is harvested) involve serving fake content (poison!), making site failures sporadic (forcing them to maintain state), and making some of those errors look like they're upstream not something you're doing on a specific machine (really bad luck mate!).
The deadenders who felt it was worth it will keep trying for at least a while; the new exploiters will tend to give up sooner. robots.txt is a courtesy. Not everybody puts stuff on the internet with a working theory that your experience is more important than theirs.
Why would you want to block Meta and Twitter? I think the rich objects on social networks pretty important, which are only shown if you let the social networks visit your site.
Kinda. Meta and Twitter want you to join their platforms, they aren't general purpose search engines scraping the entire Internet - they're scraping people that join them. Requests from Meta/Twitter are probably from a link someone put in a post.
ChatGPT can't be an impolite Internet citizen (spoofing UA's) and claim to be using AI for the good for humanity, so they're not going to be dishonest with their user-agent.
> ChatGPT can't be an impolite Internet citizen (spoofing UA's) and claim to be using AI for the good for humanity, so they're not going to be dishonest with their user-agent.
That reads an awful-lot like "Google can't be evil and claim that their motto is 'Don't be evil', so they're not going to be evil" but here we are. The profit motive eventually undoes any principled claim by a company.
Absolutely, but until then, adding bot UA's to a blocklist is somewhat useful.
Like anything else in IT security it's never "set and forget" permanently; the effectiveness of things like that decay over time and must be periodically re-evaluated.
But if something can be used to your advantage now, even if for a while, then why not use it.
robots.txt isn’t honoured only by bad bots and scrapers, so I agree with you that this nginx configuration is pretty useless and doesn’t even solve the problem on the other side!
Worth noting that most of these bots are 'good bots' (i.e. they will obey robots.txt). So you can avoid the nginx resource usage entirely by adding suitable robots.txt entries.
I think using nginx tests like this could have negative effects on showing OpenGraph metadata (including images).
If choosing this approach however I would probably respond with a 403 code to mark forbidden as bots are more likely to continue making attempts if they think the server would come back online.