Automatic updates are absolutely not peak stupidity. Most users’ devices would have nasty security vulnerabilities wide open for a much longer period of time without automatic updates.
I was thinking the same, and read it as “Xi”kipedia. Then sure enough, one of the articles that it immediately showed when it loaded was for “General Secretary of the Chinese Communist Party”, or Xi Jinping.
Hey. I run a small community forum and I've been dealing with this exact same kind of behaviour where well over 99% of requests are bad crawlers. There used to be plenty of "tells" for the faked browsers, HTTP/1.1 being a huge one. As you said, however, they're getting a bit smarter about that and it's becoming increasingly difficult to differentiate it from legitimate traffic.
It's been getting worse over the past year, with the past few weeks in particular seeing a massive change literally overnight. I had to aggressively tune my WAF rules to even remotely get things under control. With Cloudflare I'm aggressively issuing browser challenges to any browser that looks remotely suspicious, and the pass rate is currently below 0.5%. For my users' sake, a successful browser challenge is "valid" for over a month, but this still feels like another thing that'll eventually be bypassed.
I'd be keen to know if you've found any other effective ways of mitigating these most recent aggressive scraping requests. Even a simple "yes" or "no" would be appreciated; I think it's fair to be apprehensive about sharing some specific details publicly since even a lot of folks here on HN seem to think it's their right to scrape content with orders of magnitude higher throughput than all users combined.
I really don't know how this is sustainable long-term. It's eaten up quite a lot of my personal time and effort just for the sake of a hobby that I otherwise greatly enjoy.
try scraping any of the major players e.g. Amazon without residential proxy it won't work. I appreciate that you are offering to abide by crawling etiquette (e.g. robots.txt) but no major app supports that any more.
You're thinking about the case of big AI companies crawling your blog. I'm talking about a small startup trying to do traditional indexing and needing to run from residential proxy to make it work.
Thank you for speaking some sense. As a site operator that's been inundated with junk traffic over the past ~month where well in excess of 99% of it has to be blocked, the scrapers have brought this upon themselves.
I actually do let quite a few known, "good" scrapers scrape my stuff. They identify themselves, they make it clear what they do, and they respect conventions like robots.txt.
These residential proxies have been abused by scrapers that use random legit-looking user agents and absolutely hammer websites. What is it with these scrapers just not understanding consent? It's gross.
I run a really small forum and I've been absolutely inundated with a bunch of junk traffic. I had to tighten my Cloudflare WAF rules a whole bunch, and start issuing browser challenges way more aggressively.
Excluding known "good" crawlers, well over 99% of the traffic trying to hit the site has been attempting to maliciously scrape. Most of this traffic looks genuine, but has random genuine-looking user agents and comes from random residential proxies in various countries, usually the US.
For the traffic that does make it all the way to a browser challenge, the success rate is a measly 0.48%. Put another way, over 50% of traffic is already blocked by that point, and of the under 50% that makes it to a browser challenge, more than 99.5% fails that challenge.
It's been virtually no disruption to users either, since I configured successful challenges to be remembered for a long period of time. The legitimate traffic is a gentle trickle, while the WAF is holding back garbage traffic that's orders of magnitude above and beyond normal levels. The scale of it is truly insane.
For many ephemeral workloads, sure, but that comes at the expense of generally worse and less consistent CPU performance.
There are plenty of workloads where I’d love to double the memory and halve the cores compared to what the memory-optimised R instances offer, or where I could further double the cores and halve the RAM from what the compute-optimised C instances can do.
“Serverless” options can provide that to an extent, but it’s no free lunch, especially in situations where performance is a large consideration. I’ve found some use cases where it was better to avoid AWS entirely and opt for dedicated options elsewhere. AWS is remarkably uncompetitive in some use cases.
I like that https://discordstatus.com/ shows the API response times as well. There's times where Discord will seem to have issues, and those correlate very well with increased API response times usually.
Reddit Status used to show API response times way back in the day as well when I used to use the site, but they've really watered it down since then. Everything that goes there needs to be manually put in now AFAIK. Not to mention that one of the few sections is for "ads.reddit.com", classic.
I set up Immich last week and I absolutely love it. Docker is my "happy place" and I found the setup pretty straightforward, though it does have some rough edges that I anticipate will be sorted out as the project continues to mature.
I showed Immich to my partner and they loved it so much that we've ordered significantly more storage for the server to accommodate it. We're currently using both Google Photos and OneDrive, but with this we'll be ditching OneDrive and filling that niche with Immich (as well as expanded network storage in general).
The website and documentation is super clear about not using it as the only source of photos. This is why we'll keep using Google Photos, and why I'll also be backing up Immich and portions of the network storage to B2 via restic. I've used this snapshotting pattern for my general server data for years, and it's even saved me a couple of times. Backups are something you hope to never need to use, but boy are they satisfying when you do need to use them and have them set up properly!