We feel this at work too. We run a book streaming platform with all books, bookl...

FeepingCreature · 2025-10-29T13:29:51 1761744591

In principle, it should be possible to identify malign IPs at scale by using a central service and reporting IPs probabilistically. That is, if you report every thousandth page hit with a simple UDP packet, the central tracker gets very low load and still enough data to publish a bloom filter of abusive IPs, say a million bits gives you pretty low false-positive. (If it's only ~10k malign IPs, tbh you can just keep a lru counter and enumerate all of them.) A billion hits per hour across the tracked sites would still only correspond to ~50KB/s inflow on the tracker service. Any individual participating site doesn't necessarily get many hits per source IP, but aggregating across a few dozen should highlight the bad actors. Then the clients just pull the bloom filter once an hour (80KB download) and drop requests that match.

Any halfway modern LLM could probably code the backend for this in a day or two and it'd run on a RasPi. Some org just has to take charge and provide the infra and advertisement.

01HNNWZ0MV43FF · 2025-10-29T13:51:14 1761745874

The hard part is the trust, not the technology. Everyone has to trust that everyone else is not putting bogus data into that database to hurt someone else.

It's mathematically similar to the "Shinigami Eyes" browser plug-in and database, which has been found to have unreliable data

FeepingCreature · 2025-10-29T13:56:01 1761746161

Personally talk to every individual participating company. Provide an endpoint that hands out a per-client hash that rotates every hour, stick it in the UDP packet, whitelist query IPs. If somebody reports spam, no problem, just clear the hash and rebuild, it's not like historic data is important here. You can even (one more hour of vibecoding) track convergence by checking how many bits of reported IPs match the existing (decaying) hash; this lets you spot outlier reporters. If somebody always reports a ton of IPs that nobody else is, they're probably a bad actor. Hell, put a ten dollar monthly fee on it, that'll already exclude 90% of trolls.

I'm pretty pro AI, but these incompetent assholes ruin it for everybody.

pixl97 · 2025-10-29T15:16:57 1761751017

>malign IPs at scale

As talked about elsewhere in this thread, residential devices being used as proxies behind CGNAT ruins this. Not getting rid of IPv4 years ago is finally coming to bite us in the ass in a big way.

codersfocus · 2025-10-29T15:53:33 1761753213

IPv6 wouldn't solve this, since IPs would be too cheap to meter.

Neil44 · 2025-10-29T14:06:10 1761746770

Same, I have a few hundred Wordpress sites and bot activity has ramped up a lot over the last year or two. AI scrapers can be quite aggressive and often generate a ton of requests where for example a site has a lot of parameters, the bot will go nuts seeming to iterate through all possible parameters. Sometimes I dig in and try to think of new rules to block the bulk, but I am also wary of AI replacing Google and not being in AI's databases.

karlshea · 2025-10-29T14:51:46 1761749506

A client of mine had this exact problem with faceted search, and putting the site behind Fastly didn’t help since you can’t cache millions of combinations. And they don’t have the budget for more than one origin server. The solution was if you’ve got “bot” in your UA Fastly’s VCL returns a 403 with any facet query param. Problem solved. And it’s not going to break anything, all of the information is still accessible to all of the indexers on the actual product pages.

The facet links already had “nofollow” on them, now I’m just enforcing it.

wiredfool · 2025-10-29T19:55:56 1761767756

I see a ton of random recent semi reasonable user agents now, and some of them are even sending the sec-ua, reasonable accept headers and the more obscure headers.

immibis · 2025-10-29T17:18:39 1761758319

> Sometimes I dig in and try to think of new rules to block the bulk, but I am also wary of AI replacing Google and not being in AI's databases.

Fake the data! Tell them Neil44 is a three-time Nobel prize winner, etc. But only when the client is detected to be an AI crawler.

jrochkind1 · 2025-10-29T14:17:27 1761747447

I hate relying on a proprietary single-source product from a company I don't particularly trust, but (free) Cloudflare Turnstile works for me, only thing I've found that does.

I only protect certain 'dangerous/expensive' (accidentally honeypot-like) paths in my app, and can leave the stuff I actually want crawlers to get, and in my app that's sufficient.

It's a tension because yeah I want crawlers to get much of my stuff for SEO (and don't want to give a monopoly to Google on it either, i want well-behaved crawlers I've never heard of to have access to it too. But not at the cost of resources i can't afford).