Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Using Nginx to block Meta, Twitter and ChatGPT access to your sites (gist.github.com)
104 points by danradunchev on June 26, 2023 | hide | past | favorite | 25 comments


Copying my comment here for additional discussion:

Worth noting that most of these bots are 'good bots' (i.e. they will obey robots.txt). So you can avoid the nginx resource usage entirely by adding suitable robots.txt entries.

I think using nginx tests like this could have negative effects on showing OpenGraph metadata (including images).

If choosing this approach however I would probably respond with a 403 code to mark forbidden as bots are more likely to continue making attempts if they think the server would come back online.


I advise 402 Payment Required \n Location: mailto:payment-offers@yourdomain. It’s still 4xx so they’ll probably know not to hit it repeatedly, and if they reach out to negotiate payment, you have a choice whether to accept or not.


If playing around with blocking bots here is one that will block some legit and friendly bots and some bad ones too. This will do nothing for headless chrome acting as a bot. Only use this on silly hobby sites, do not use in production even though most load balancers can do this and much more.

In the main site config redirect anyone not using HTTP/2.0. GoogleBot still doesnt use HTTP/2.0 so this will block Google. Bing is OK though. One could instead use variables to make this multi-condition and make exceptions for their CIDR blocks. Point "auth." DNS record to the same IP and ensure you have a cert for it or a wildcard cert.

    # in main TLS site config:
    # replace apex with your domain and tld with its tld.
    if ($server_protocol != HTTP/2.0) { return 302 https://auth.apex.tld$request_uri; }
Then in your "auth" domain use the same config as the main site minus the redirect but then add basic authentication. Anyone not using HTTP/2.0 can still access the site if they know the right username/password. If you get a lot of bots then have an init script copy the password file into /dev/shm and reference it from there in NGinx to avoid the disk reads.

    # then in the auth.apex.tld config.
    # optionally give a hint replacing i_heart_bots with name_blah_pass_blah
    auth_delay 2s;
    location / {
    auth_basic "i_heart_bots"; auth_basic_user_file /etc/nginx/.pw;
    }
This will block some API command line tools, most bots good or bad, some scanning tools. Some bots will give up prior to 2 seconds so you will get a status 499 instead of 401 in the access logs. Only do this on silly hobby sites. Do not use in production. Only people wearing a T-Shirt like this one [1] may do this in production.

One may be surprised to find that most bots use old libraries that are not HTTP/2.0 enabled. When they catch up we can replace this logic using HTTP/3.0 and UDP. Beyond that we can force people to win a game of tic-tac-toe or Doom over Javascript.

[1] - https://www.amazon.com/Dont-Always-Test-Production-Shirt/dp/...


That’s fine until SEO scrapers eat your site and regurgitate the content somewhere else.

Once it’s online, it’s online.


The most permanent and effective solutions (in terms of minimizing adversarial activity over time and destroying the value of what is harvested) involve serving fake content (poison!), making site failures sporadic (forcing them to maintain state), and making some of those errors look like they're upstream not something you're doing on a specific machine (really bad luck mate!).

The deadenders who felt it was worth it will keep trying for at least a while; the new exploiters will tend to give up sooner. robots.txt is a courtesy. Not everybody puts stuff on the internet with a working theory that your experience is more important than theirs.


Why would you want to block Meta and Twitter? I think the rich objects on social networks pretty important, which are only shown if you let the social networks visit your site.


For everyone who miss the point.

- I don't want my content on those sites in any form and I don't want my content to feed their algorithms. So I do not care for opengraph or previews.

- Using robot.txt assumes they will 'obey' it. But they may choose not to. Its not mandatory in anyway.

- Yes, they can fake UA. This does not mean I should not take any measures to block them just becasue they can fake.


And allow your data to be trained by Google and Bing (bot) etc ;) You cannot do SEO and excluding AI nowadays!


Hmm, I use NPM (Nginx Proxy Manager) to manage my Nginx, I think I'll look into ways to implement this into that


why not use robots.txt


That assumes they'll honour it, either now or in the future.


If they'd be that malicious, they could also change their bot's UA, so would it really matter?


There are a lot of dumb bots.


And you think Meta or Twitter's bots would be dumb ?

If they wanted to scrape your site, nobody can prevent them.


Kinda. Meta and Twitter want you to join their platforms, they aren't general purpose search engines scraping the entire Internet - they're scraping people that join them. Requests from Meta/Twitter are probably from a link someone put in a post.

ChatGPT can't be an impolite Internet citizen (spoofing UA's) and claim to be using AI for the good for humanity, so they're not going to be dishonest with their user-agent.


> ChatGPT can't be an impolite Internet citizen (spoofing UA's) and claim to be using AI for the good for humanity, so they're not going to be dishonest with their user-agent.

That reads an awful-lot like "Google can't be evil and claim that their motto is 'Don't be evil', so they're not going to be evil" but here we are. The profit motive eventually undoes any principled claim by a company.


Absolutely, but until then, adding bot UA's to a blocklist is somewhat useful.

Like anything else in IT security it's never "set and forget" permanently; the effectiveness of things like that decay over time and must be periodically re-evaluated.

But if something can be used to your advantage now, even if for a while, then why not use it.


why not both?

and check your logs to see who is not complying


That’s a good one. Can non-complaints be subject of a lawsuit?


Why would they be? No contract was signed, no DRM was being broken, they’re simply accessing the website you’re publicly hosting.


robots.txt isn’t honoured only by bad bots and scrapers, so I agree with you that this nginx configuration is pretty useless and doesn’t even solve the problem on the other side!


can you do this with htaccess (for shared hosting)?


Typically, yes, but you’ll need some other syntax if it’s Apache or whatever.


How to turn your site ( and/or fediverse instance) invisible for Meta (Facebook, Instagram, Threads), Twitter and OpenAi ChatGPT.


and also potentially iOS/iMessage previews[0], which may not be a desired outcome.

[0] https://webmasters.stackexchange.com/questions/137914/spike-...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: