More

thaumaturgy · 2025-11-16T21:05:22 1763327122

I have been able to get them deactivated in two cities. They have not yet been physically removed but that is looking like a likely near-term outcome.

Flock has been a "side project" that's been eating about as many hours as a part-time job since late June. I have spoken at city council meetings in two cities, met individually with city councilors, met with a chief of police, presented to city councilors in Portland, am in almost daily conversations with ACLU Oregon, have received legal advice from EFF, done numerous media interviews, and I have an upcoming presentation to the state Senate Judiciary Committee. I may also be one of the reasons that Ron Wyden's office investigated Flock more carefully over the Summer and recently released a letter suggesting that cities terminate their relationship with the company.

All of which is to say I've been in it for a while now and have had some wins.

Good and bad news: it's a lot easier to fight it now than it was in June, but it's still going to take more effort than you probably imagine.

You'll need a team. I'm one member of a community working group. We have a core group of about a half-dozen active organizers. We have filed (and paid thousands in fees for) tons of public records requests, done a lot of community organizing and outreach, built partnerships with adjacent activist organizations, and done original technical research.

There are a couple of different strategies to pursue that can kick these things out of a community. My recommendation is to find the one that you like best, and find other people that like other ones, and pursue them in parallel.

Depending on your local police department, you may find them to be surprisingly cooperative, or you may find that they dig in and start putting in an equal amount of effort to block yours. I've had both. Odds are that your city councilors are not aware at all of what Flock is or how it works, so your first step is to raise awareness. I strongly recommend starting with an approach that makes you seem like a reasonable, honest, and reliable member of your community.

I realize this comment isn't super helpful by itself. I'm a bit distracted at the moment and I don't think I could figure out how to write a helpful, comprehensive, and yet concise comment here on this. I need to put together an info packet for people that want to get efforts like this one started in their own community. In the meanwhile, you should be able to email contact@eyesoffeugene.org and I'm happy to provide advice and assistance to anyone that wants to take this up in their city.

JumpCrisscross · 2025-11-16T21:18:16 1763327896

Would you be open to consulting for a group that's trying to do the same in west Wyoming?

> There are a couple of different strategies to pursue that can kick these things out of a community

Would love to hear more about these, even if it's just a wall of links or brief thoughts.

thaumaturgy · 2025-11-16T21:59:10 1763330350

> Would you be open to consulting for a group that's trying to do the same in west Wyoming?

Absolutely!

Re: Strategies

- Public records requests (aka FOIAs, though FOIA is technically for federal stuff): this has been a big one for us. File a request for the contract, a request for the locations, a request for communications, requests for the network audit, and more. PRRs take practice, but I can put you in touch with someone that's become an expert at them. Some requests may come with price tags attached and in some cases they can be expensive. Usually that means either the agency is fighting you or something in the request needs to be reworded.

- Comms: set up a site (go with something quick and easy for multiple people to use), we've had good luck setting up a community chat on Signal (now with almost 100 participants). I've spent a pile of hours just assembling different slide decks that digest lots of Flock info into smaller bits for people learning about it for the first time.

- Show up: things got rolling here when a couple of people used the public comment period at local city council meetings. Local media often monitor city council meetings, and if you're a new face and you're saying something interesting, there may be a brief interview afterward.

- Gather intelligence: we've gotten to know our local politicians pretty well. You'll want to keep some notes on where everyone stands on it, who can be moved, who prefers individual meetings, talking points they may be responsive to.

- Engage with other local activist groups. Flock s a problem that affects people with lots of different political opinions.

- Try meeting with your local police department chief and just initiate a conversation about it. They may not be as pro-Flock as you'd expect. You at least want to figure out where they stand on it and let them see you as not a direct opponent from the get-go.

- Make contact with your local chapter of the ACLU. In our case, they've filed a lawsuit on our behalf over a public records request that the city refused to fulfill and the county DA denied on appeal.

- Write lots of emails to local officials, offer to meet them for coffee. They can be hard to reach initially, but once you get that initial meeting, if it goes well, they know who you are and they'll answer your texts. We are now having frequent text chats with city councilors and police commissioners and even state legislators.

This is all just off the top of my head real quick, I am probably forgetting at least one important strategy. But each of these can take a lot of time and each benefits from different skill sets, so that's where having a small group of people is really helpful.

Rather than trying to set up a hierarchical, official organization, we decided early on to just run as an ad-hoc informal "working group", and each of us would just pick up whatever tasks we were most interested in. That has worked out really well.

JumpCrisscross · 2025-11-17T05:54:19 1763358859

How do we get in touch?

thaumaturgy · 2025-11-17T06:48:49 1763362129

https://eyesoffeugene.org/contact or contact@eyesoffeugene.org.

thaumaturgy · 2025-11-14T08:40:43 1763109643

You're not alone. I've been on the web since around 1997, something like that. I remember it as a fun distraction, but also as a place that had recognizable handles, behind which sat a real person somewhere else in the world.

Unrestrained SEO and the failure of search engines (or, in Google's case, complicity veering towards enthusiastic support) to do anything about that was the first thing that, for me, took a lot of the fun out of the web.

Cheap botting, engagement farming, walled gardens, social media, and now AI has left me in a state of active avoidance. I don't feel good when I use the web. Like, any of it, at all.

Casual cruelty has always been a problem of online interaction, but at one time it was also balanced out by familiarity, friendliness sometimes, creativity ... but those things have gotten a lot harder to find.

The most engaging online interaction I've had recently has been some local community groups on Signal, and even that is best in small and infrequent doses.

thaumaturgy · 2025-10-29T22:18:05 1761776285

OpenAI announced $10 billion ARR in June of this year (https://www.cnbc.com/2025/06/09/openai-hits-10-billion-in-an...).

Agriculture in California hit $61 billion in annual receipts in 2024 (https://www.cdfa.ca.gov/statistics/).

So, not that OpenAI isn't big, but, "the heart of the California economy"?

OpenAI needs to IPO, because if they don't get in on the current meme stock economy, they're going to collapse.

thaumaturgy · 2025-10-02T17:20:39 1759425639

I was really hoping this would at last be a treatment of the most realistic risk for AI, but no.

The real risk -- and all indicators are that this is already underway -- is that OpenAI and a few others are going to position themselves to be the brokers for most of human creative output, and everyone's going to enthusiastically sign up for it.

Centralization and a maniacal focus on market capture and dominance have been the trends in business for the last few decades. Along the way they have added enormous pressures on the working classes, increasing performance expectations even as they extract even more money from employees' work product.

As it stands now, more and more tech firms are expecting developers to use AI tools -- always one of the commercial ones -- in their daily workflows. Developers who don't do this are disadvantaged in a competitive job market. Journalism, blogging, marketing, illustration -- all are competing to integrate commercial AI services into their processes.

The overwhelming volume of slop produced by all this will pollute our thinking and cripple the creative abilities of the next generation of people, all the while giving these handful of companies a percentage cut of global GDP.

I'm not even bearish on the idea of integrating AI tooling into creative processes. I think there are healthy ways to do it that will stimulate creativity and enrich both the creators and the consumers. But that's not what's happening.

thaumaturgy · 2025-09-26T06:31:34 1758868294

The best you can get is https://deflock.me/map, which is crowd-sourced, and therefore both incomplete and inaccurate.

Cities tend to resist public records requests for camera locations.

But Flock is currently in ~5,000 communities around the country. They have managed to spread very quickly, and very quietly, and the public has only become aware of it relatively recently.

There is also a good site at https://eyesonflock.com/ that parses data from the transparency pages that some places publish.

toomuchtodo · 2025-09-26T19:23:16 1758914596

https://www.muckrock.com/foi/list/?q=Flock

https://www.404media.co/tag/flock/

thaumaturgy · 2025-09-12T21:10:57 1757711457

Interesting, I was under the impression this was more common than maybe it is. I know the hosting market has gotten pretty bad.

So, I'm currently building pretty much this. After doing it on the side for clients for years, it's now my full-time effort. I have a solid and stable infrastructure, but not yet an API or web frontend. If somebody wants basically ssh, git, and static (or even not static!) hosting that comes with a sysadmin's contact information for a small number of dollars per month, I can be reached at sysop@biphrost.net.

Environment is currently Debian-in-LXC-on-Debian-on-DigitalOcean.

thaumaturgy · 2025-09-02T18:37:13 1756838233

People outside of a really small sysadmin niche really don't grasp the scale of this problem.

I run a small-but-growing boutique hosting infrastructure for agency clients. The AI bot crawler problem recently got severe enough that I couldn't just ignore it anymore.

I'm stuck between, on one end, crawlers from companies that absolutely have the engineering talent and resources to do things right but still aren't, and on the other end, resource-heavy WordPress installations where the client was told it was a build-it-and-forget-it kind of thing. I can't police their robots.txt files; meanwhile, each page load can take a full 1s round trip (most of that spent in MySQL), there are about 6 different pretty aggressive AI bots, and occasionally they'll get stuck on some site's product variants or categories pages and start hitting it at a 1r/s rate.

There's an invisible caching layer that does a pretty nice job with images and the like, so it's not really a bandwidth problem. The bots aren't even requesting images and other page resources very often; they're just doing tons and tons of page requests, and each of those is tying up a DB somewhere.

Cumulatively, it is close to having a site get Slashdotted every single day.

I finally started filtering out most bot and crawler traffic at nginx, before it gets passed off to a WP container. I spent a fair bit of time sampling traffic from logs, and at a rough guess, I'd say maybe 5% of web traffic is currently coming from actual humans. It's insane.

I've just wrapped up the first round of work for this problem, but that's just buying a little time. Now, I've gotta put together an IP intelligence system, because clearly these companies aren't gonna take "403" for an answer.

gjsman-1000 · 2025-09-02T18:56:29 1756839389

I might write a blog post on this, but I seriously believe we collectively need to rethink The Cathedral and the Bazaar.

The Cathedral won. Full stop. Everyone, more or less, is just a stonecutter, competing to sell the best stone (i.e. content, libraries, source code, tooling) for building the cathedrals with. If the world is a farmer's market, we're shocked that the farmer's market is not defeating Walmart, and never will.

People want Cathedrals; not Bazaars. Being a Bazaar vendor is a race to the bottom. This is not the Cathedral exploiting a "tragedy of the commons," it's intrinsic to decentralization as a whole. The Bazaar feeds the Cathedral, just as the farmers feed Walmart, just as independent websites feed Claude, a food chain and not an aberration.

thaumaturgy · 2025-09-02T19:28:53 1756841333

The Cathedral and the Bazaar meets The Tragedy of the Commons.

Let's say there's two competing options in some market. One option is fully commercialized, the other option holds to open-source ideals (whatever those are).

The commercial option attracts investors, because investors like money. The money attracts engineers, because at some point "hacker" came to mean "comfortable lifestyle in a high COL area". The commercial option gets all the resources, it gets a marketing team, and it captures 75% of the market because most people will happily pay a few dollars for something they don't have to understand.

The open source option attracts a few enthusiasts (maybe; or, often, just one), who labor at it in whatever spare time they can scrape together. Because it's free, other commercial entities use and rely on the open source thing, as long it continues to be maintained in something that, if you squint, resembles slave labor. The open source option is always a bit harder to use, with fewer features, but it appeals to the 25% of the market that cares about things like privacy or ownership or self-determination.

So, one conclusion is "people want Cathedrals", but another conclusion could be that all of our society's incentives are aligned towards Cathedrals.

It would be insane, after all, to not pursue wealth just because of some personal ideals.

rurp · 2025-09-02T20:23:28 1756844608

This is pretty much a more eloquent version of what I was about to write. It's dangerous to take a completely results oriented view of a situation where the commercial incentives are so absurdly lopsided. The cathedral owners spend more than the GDP of most countries every year on various carrots and sticks to maintain something like the current ecosystem. I think the current world is far from ideal for most people, but it's hard to compete against the coordinated efforts of the richest and most powerful entities in the world.

gjsman-1000 · 2025-09-02T19:35:01 1756841701

The answer is quite simply that where complexity exceeds the regular person's interest, there will be a cathedral.

It's not about capitalism or incentives. Humans have cognitive limits and technology is very low on the list for most. They want someone else to handle complexity so they can focus on their lives. Medieval guilds, religious hierarchies, tribal councils, your distribution's package repository, it's all cathedrals. Humans have always delegated complexity to trusted authorities.

The 25% who 'care about privacy or ownership' mostly just say they care. When actually faced with configuring their own email server or compiling their own kernel, 24% of that 25% immediately choose the cathedral. You know the type, the people who attend FOSDEM carrying MacBooks. The incentives don't create the demand for cathedrals, but respond to it. Even in a post-scarcity commune, someone would emerge to handle the complex stuff while everyone else gratefully lets them.

The bazaar doesn't lose because of capitalism. It loses because most humans, given the choice between understanding something complex or trusting someone else to handle it, will choose trust every time. Not just trust, but CYA (I'm not responsible for something I don't fully understand) every time. Why do you think AI is successful? I'd rather even trust a blathering robot than myself. It turns out, people like being told what to do on things they don't care about.

AnthonyMouse · 2025-09-02T21:24:35 1756848275

> The Bazaar feeds the Cathedral

Isn't this the licensing problem? Berkeley release BSD so that everyone can use it, people do years of work to make it passable, Apple takes it to make macOS and iOS because the license allows them to, and then they have both the community's work and their own work so everyone uses that.

The Linux kernel is GPLv2, not GPLv3, so vendors distribute binary blob drivers/firmware with their hardware and then the hardware becomes unusable as soon as they stop publishing new versions because then to use the hardware you're stuck with an old kernel with known security vulnerabilities, or they lock the boot loader because v2 lacks the anti-Tivoization clause in v3.

If you use a license that lets the cathedral close off the community's work then you lose, but what if you don't do that?

jazzyjackson · 2025-09-02T19:04:34 1756839874

Couldn't it be addressed in front of the application with a fail2ban rule, some kind of 429 Too Many Requests quota on a per session basis? Or are the crawlers anonymizing themselves / coming from different IP addresses?

thaumaturgy · 2025-09-02T19:14:42 1756840482

Yeah, that's where IP intelligence comes in. They're using pretty big IP pools, so, either you're manually adding individual IPs to a list all day (and updating that list as ASNs get continuously shuffled around), or you've got a process in the background that essentially does whois lookups (and caches them, so you aren't also being abusive), parses the metadata returned, and decides whether that request is "okay" or not.

The classic 80/20 rule applies. You can catch about 80% of lazy crawler activity pretty easily with something like this, but the remaining 20% will require a lot more effort. You start encountering edge cases, like crawlers that use AWS for their crawling activity, but also one of your customers somewhere is syncing their WooCommerce orders to their in-house ERP system via a process that also runs on AWS.

asddubs · 2025-09-03T10:50:34 1756896634

I've had crawlers get stuck in a loop before on a search page where you basically could just keep adding things, even if there are no results. I filtered requests that are bots for sure (requests which are specified long past the point of any results). It was over a million unique IPs, most of which only doing 1 or 2 requests on their own (from many different ip blocks)

sc68cal · 2025-09-02T19:12:42 1756840362

They are spreading themselves across lots of different IP blocks

loloquwowndueo · 2025-09-02T19:21:27 1756840887

Its called Anubis.

dylan604 · 2025-09-02T20:30:33 1756845033

Isn't that the one that shows anime characters? Or is Anubis the "professional" version that doesn't show anime chars?

greazy · 2025-09-02T21:18:39 1756847919

Yes that's Anubis. And yes you pay to not show anime cat girl.

tempaccount420 · 2025-09-03T07:39:21 1756885161

That's genius.

krapp · 2025-09-03T10:05:59 1756893959

Honestly the more Anubis' anime mascot annoys people the more I like it.

dylan604 · 2025-09-03T14:57:53 1756911473

The point of this is to make things difficult for bots, not to annoy visitors of the site. I respect it is the dev's choice to do what they want with the software they create and make available for free. Anime is a polarizing format for reasons beyond the scope of this discussion. It definitely says a lot about the dev

krapp · 2025-09-03T15:15:22 1756912522

Anime is only "polarizing" for an extreme subset of people. Most people won't care. No one should care, it's just a cute mascot image.

linotype · 2025-09-04T13:46:25 1756993585

It says a lot more about the pearl clutching of the people complaining about it than it does the dev.

croemer · 2025-09-03T22:30:06 1756938606

Anubis blocks all phones with odd processor counts, many Pixel phones for example.

TheServitor · 2025-09-02T23:11:54 1756854714

There are some ASN-based DROP list collections on GitHub if that would help.

thaumaturgy · 2025-09-02T23:23:36 1756855416

Oh! That didn't even occur to me. Yeah, I could pump that into ipset. Got one in particular that you think is reliable?

TheServitor · 2025-09-02T23:50:12 1756857012

I think Spamhaus runs the big one.

jay_kyburz · 2025-09-02T20:22:58 1756844578

This is probably a dumb question, but at what point do we put a simple CAPTCHA in front of every new user that arrives at a site, then give them a cookie and start tracking requests per second from that user?

I guess its a kind of soft login required for every session?

update: you could bake it into the cookie approval dialog (joke!)

thaumaturgy · 2025-09-02T21:18:56 1756847936

The post-AI web is already a huge mess. I'd prefer solutions that don't make it worse.

I myself browse with cookies off, sort of, most of the time, and the number of times per day that I have to click a Cloudflare checkbox or help Google classify objects from its datasets is nuts.

dragonwriter · 2025-09-02T21:23:14 1756848194

> The post-AI web is already a huge mess.

You mean the peri-AI web? Or is AI already done and over and no longer exerting an influence?

AnthonyMouse · 2025-09-02T21:19:17 1756847957

> meanwhile, each page load can take a full 1s round trip (most of that spent in MySQL)

Can't these responses still be cached by a reverse proxy as long as the user isn't logged in, which the bots presumably aren't?

everforward · 2025-09-02T22:16:35 1756851395

They're presumably not crawling the same page repeatedly, and caching the pages long enough to persist between crawls would require careful thinking and consultation with clients (e.g. if they want their blog posts to show up quickly, or an "on sale" banner or etc).

It'd probably be easier to come at it from the other side and throw more resources at the DB or clean it up. I can't imagine what's going on that it's spending a full second on DB queries, but I also don't really use WP.

benjiro · 2025-09-02T23:24:55 1756855495

Its been a few years when i last worked with WP. But the performance issue is because they store a ton of the data in a key value store, instead of table with fixed columns.

This can result in a ton of individual row hits on your database, for what in any normal system is a single 0.1ms (often faster) DB request.

Any web scraper that is scraping SEQUENCIALLY at 1r/s is actually a well behaved and non-intrusive scraper. Its just that the WP is in general ** for performance.

If you want to see what a bad scraper does with parallel requests with little limits, yea, WP is going down without putting up any struggle. But everybody wanted to use WP, and now those ducks are coming home to roost when there is a bit more pressure.

everforward · 2025-09-03T00:20:58 1756858858

Is that WP Core or a result of plugins? If you know offhand, I don't need to know bad enough to be worth digging in.

> Any web scraper that is scraping SEQUENCIALLY at 1r/s is actually a well behaved and non-intrusive scraper.

I think there's still room for improvement there, but I get what you mean. I think an "ideal" bot would base it's QPS on response time and back off if it goes up, but it's also not unreasonable to say "any website should be able to handle 1 QPS without flopping over".

> Its just that the WP is in general * for performance.

WP gets a lot of hate, and much of it is deserved, but I genuinely don't think I could do much better with the constraint of supporting an often non-technical userbase with a plugin system that can do basically arbitrary things with varying qualities of developers.

> But everybody wanted to use WP, and now those ducks are coming home to roost when there is a bit more pressure.

This is actually an interesting question, I do wonder if WP users are over-represented in these complaints and if there's a potential solution there. If AI scrapers can be detected, you can serve them content that's cached for much longer because I doubt either party cares for temporally-sensitive content (like flash sales).

benjiro · 2025-09-03T09:50:16 1756893016

> Is that WP Core or a result of plugins?

Combination of all ... Take in account, its been 8 years when i last worked in PHP and wordpress, so maybe things have improved but i doubt it as some issues are structural.

* PHP is a fire and forget programming language. So whenever you do a request, there is no persistence of data (unless you offload to a external cache server). This result in total rerendering of the PHP code.

* Then we have WP core, that is not exactly shy in its calls to the DB. The way they store data in a key/value system really hurts the performance. Remember what i said above about PHP, ... So if you have a design that is heavy, and your language need to redo all the calls.

* Followed by ... extensions that are, lets just say, not always optimally written. The plugins are often the main reason why you see so many leaked databases on the internet.

The issue of WP is that its design is like 25 years old. It gain most of its popularity because it was free and you where able to extend it with plugins. But its that same plugin system, that made it harder for the WP developers to really tackle the performance issues, as breaking a ton of plugins, often results in losing marketshare.

The main reason why WP has survived the increased web traffic, has been that PHP has increased in performance by a factor of 3x over the years, combined with server hardware itself getting faster and faster. It also helped that cache plugins exist for WP.

But now as you have noticed, when you have a ton of passive or aggressive scrapers hitting WP websites, the cache plugins what have been the main protection layer to keep WP sites functional, they can not handle this. As scrapers hit every page, even pages that are non-popular/archived/... and normally never get cached. Because your getting hit on those non-popular pages, this then shows the fundamental weakness of WP.

The only way you can slightly deal with this type of behavior (beyond just blocking scrapers), is by increasing your database memory limits by a ton, so your not doing constant swapping. Increase the caching of the pages on your actual WP cache extensions, so more is held into memory. Your probably also looking at increasing the amount of PHP instances your server can load, more DB ...

But that assumes you have control over your WP hosting environment. And the companies that often host 100.000 or millions of sites, are not exactly motivated to throw tons of money into the problem. They prefer that you "upgrade" to more expensive packages that will only partially mitigate the issue.

In general, everybody is f___ed ... The amount of data scraping is only going to get worse.

Especially now that LLM's have tool usage, as in, they can search the internet for information themselves. This is going to results in tens of millions of requests from LLMs. Somebody searching for cookie requests, may results in dozens of page hits, in a second, where a normal user in the past first did a google search (hits Google cache), and only then opens a page, ... not what they want, go back, somewhere else. What may have been 10 requests over multiple sites, over a 5, 10 min time frame, is now going to be parallel dozens of request per second.

LLMs are great search engines, but as the tech goes more to consumer level hardware, your going to see this only getting worse.

Solutions are a fundamental rework of a lot of websites. One of the main reasons i switch out of PHP years ago, and eventually settled on Go, was because even at that time, was that we hit hitting limits already. Its one of the reasons that Facebook made Hack (PHP with persistence and other optimizations). The days you can render complete pages, is just giving away performance. The days you can not internal cache data, ... you get the point.

> This is actually an interesting question, I do wonder if WP users are over-represented in these complaints and if there's a potential solution there. If AI scrapers can be detected, you can serve them content that's cached for much longer because I doubt either party cares for temporally-sensitive content (like flash sales).

The issue is not cache content, is that they go for all the data in your database. They do not care if your articles are from 1999.

The only way you can solve this issue, is by having API endpoints to every website, where scraper can directly feed on your database data directly (so you avoid needing to render complete pages), AND where they can feed on /api/articles/latest-changed or something like that.

And that assumes that this is standardized over the industry. Because if its not, its just easier for scraper to go after all pages.

Fyi: I wrote my own scraper in Go, a dual core VPS that costs 3 Euro in the month, what can do 10.000 scraper per second (we are talking direct scraps, not over browser to deal with JS detection).

Now, do you want to guess the resource usage on your WP server, if i let it run wild ;) Your probably going to spend 10 to 50x more money, just to feed my scraper without me taking your website down.

Now, do i do 10.000 per second request. No ... Because 1r/s per website, is still 86400 page hits per day. And because i combined this with actually looking up websites that had "latest xxxx", and caching that content. I knew that i only needed to scrap X amount of new pages every 24h. So it took me a month or 3 for some big website scraping, and later you do not even see me as i am only doing page updates.

But that takes work! You need to design this for every website, some websites do not have any good spot where you can hook into for a low resource "is there something new".

And i do not even talk about websites that actively try to make scraping difficult (like constantly changing tags, dynamic html blocks on renders, js blocking, captcha forcing), what ironically, hurt them more as this can result in full rescraps of their sites.

So ironically, the most easy solution that for less scrupulous scrapers is to simply throw resource at the issue. Why bother with "is there something new" effort on every website, when you can just rescrap every page link you find using a dumb scraper, and compare that with your local cache checksum, and then update your scraped page result. And then you get those over aggressive scraper that ddos websites. Combine that with half of the internet being WP websites +lol+

The amount of resource to scrap, is so small, and the more you try to prevent scrapers, the more your going to hinder your own customers / legit users.

And again, this is just me doing scraping for some novel/manga websites for my own private usage / datahoarding. The big boys have access to complete IP blocks, can resort to using home ips (as some sites detect if your coming from a datacenter leased IP or home ISP ip), have way more resources available to them.

This has been way too long but the only way to win against scrapers, is that we will need a standardized way for legit scraping. Ironically we used to have this with RSS feeds years ago but everybody gave up on them. When you have a easier endpoint for scrapers, there is less incentive to just scrap your every page for a lot of them. Will there be bad guys, yep, but it then becomes easier to just target them until they also comply.

But the internet will need to change to something new for it to survive the new era ... And i think standardized API endpoints will be that change. Or everybody needs to go behind login pages, but yea, good luck with that because even those are very easy to bypass with account creations solutions.

Yea, everybody is going to be f___ed because forget about making money with advertisement for the small website. The revenue model is going to also change. We already see this with reddit selling their data directly to google.

And this has been way too much text.

dumbledoren · 2025-09-03T23:17:32 1756941452

> The way they store data in a key/value system really hurts the performance

It doesnt, unless your site has a lot of post/product/whatever entries in the db and you are having your users search from among them with multiple criteria at the same time. Only then does it cause many self-joins to happen and creates performance concerns. Otherwise the key-value setup is very fast when it comes to just pulling key+value pairs for a given post/content.

Today Wordpress is able to easily do 50 req/sec cached (locally) on $5/month hosting with PHP 8+. It can easily do 10 req/sec uncached for logged in users, with absolutely no form of caching. (though you would generally use an object cache, pushing it much higher).

White House is on Wordpress. NASA is on Wordpress. Techcrunch, CNN, Reuters and a lot more.

benjiro · 2025-09-04T11:30:24 1756985424

Just want to point out that your 50 req/sec cached means nothing in case of dealing with scrapers. What is the entire topic ...

he issue is that scrapers hit so many pages, that you can never cache everything.

If you website is a 5 page blog, that has no build up archive of past posts, sure... Scrapers are not going to hurt because they keep hitting the cached pages and resetting the invalidation.

But for everybody else, getting hit on uncached pages, results in heavy DB loads, and kills your performance.

Scrapers do not care about your top (cached) pages, especially aggressive ones that just rescrape non-stop.

> It doesnt, unless your site has a lot of post/product/whatever entries in the db

Exactly what is being hit by scrapers...

> White House is on Wordpress. NASA is on Wordpress. Techcrunch, CNN, Reuters and a lot more.

Again not the point. They can throw resources onto the problem, and cache tons of data with 512GB/1TB wordpress/DB servers. By that, turns WP into a mostly static site.

Its everybody else that feels the burn (see article, see the previous poster and other).

Do you understand the issue now? WP is not equipped to deal with this type of traffic as its not normal human traffic. WP is not designed to handle this, it barely handles normal traffic without throwing a lot of resources on it.

There is a reason why the reddit/Slashdot effect exists. Just a few 1000 people going to a blog tend to make a lot of WP websites unresponsive. And that is with the ability to cache those pages!

Now imagine somebody like me, that lets a scraper lose on your WP website. I can scrap 10.000 pages / sec on a 4 bucks VPS. But each page that i hit that is not in your cache, will make your DB scream even more, because of how WP works. So what are you going to do with your 50 req/s cached, when my next 9.950 req/s hit all your non-cached pages?! You get the point?

And fyi: 10.000r/s on your cached pages will make your wp install also unresponsive. The scraper resource usage vs WP is a fight nobody wins.

thaumaturgy · 2025-09-02T22:13:05 1756851185

That would be nice! This doesn't work reliably enough for WP sites. Whether it's devs making changes and testing them in prod, or dynamic content loaded in identical URLs, my past attempts to cache html have caused questions and complaints. The current caching strategy hits a nice balance and hasn't bothered anyone, with the significant downside that it's vulnerable to bot traffic.

(If you choose to read this as, "WordPress is awful, don't use WordPress", I won't argue with you.)

thaumaturgy · 2025-08-31T19:22:48 1756668168

The idea of "a fall" is a universal bug in our thinking, I believe.

Most of what we learn has been from dramatization. There is, maybe, slightly less now, offset by more technical text, but that's a really recent change in human history. And even then, how many people's ideas of history are entirely formed from television shows and movies, or even stories told to each other?

We talk about what makes stories compelling. Compelling stories have a beginning, a climax, and an end. History thus also has a beginning, climax, and end.

So we think of the Roman Empire as a thing in history that had a beginning, and a rise, and then a fall.

Yet Rome still exists.

I see this a lot in discussions about businesses. Somebody will do something dumb, and then immediately two camps form: those that throw darts at the exact date that the business will cease to exist, and those that mock the first camp's predictions of timely demise.

So we get these really repetitive, entirely pointless debates after the fact about whether the business is "dead" or not, so everybody can try to figure out which camp was right.

But, in the general case, it never works that way. For every WebVan, there are a hundred Reddits: persistently spectral businesses, online and still making money, occasionally bolstered by some CEO whose job it is to convince everyone that the business is still as full of verve as ever it was, and yet, when people think of it, they think of it nostalgically, if they do at all.

Slashdot is still online and posting stories every day.

Rome never fell; its story changed. It stopped being the main character of historical storytelling for a certain period, but of course that leaves us wondering what happened to that character, and when exactly did that character die?

MangoToupe · 2025-09-01T07:15:42 1756710942

> Yet Rome still exists.

A city exists in the same place, yes, but you'd be insane to fail to note the centuries of economic decline and the nearly complete disappearance of the bureaucratic state.

pqtyw · 2025-08-31T21:58:10 1756677490

> idea of "a fall" is a universal bug in our thinking

Occasionally it did happen that way though. e.g. the Soviet Union was seemingly fine (or not much worse that before) and suddenly it and its empire was in a couple of years.

Their style of communism/socialism went down with them and is pretty much dead.

binary132 · 2025-09-01T11:18:28 1756725508

Ironically this is a pretty good point in favor of GP, since the “powers that were” in Soviet Russia retained a great deal of their power and connections, as sibling comment noted. That said, I don’t think there is even a remote resemblance between the imperial might of Old Rome and its remnants in Italia. What did and does remain are the lineage of Romanized nations that succeeded it in Western Europe, and as many others have noted, the Church, which is a whole other subject in itself.

pqtyw · 2025-09-01T15:13:17 1756739597

> between the imperial might of Old Rome and its remnants in Italia

After more than a thousand years that's true.

But Rome remained the "universal [Christian] Empire" well into the middle ages. In a way didn't really matter if its capital was in Constantinople or Aachen or Rome. Or if its ruler spoke Latin, Greek or even German.

Its ~35 years since the the collapse of the USSR, its cultural identity has pretty much disappeared.

binary132 · 2025-09-01T15:31:56 1756740716

I would consider Old Rome and the Holy Roman Empire to be two distinct entities.

pqtyw · 2025-09-01T16:00:09 1756742409

Perhaps, yet a reasonably educated person in the Frankish Empire or the early Holy Roman Empire probably wouldn't. It wouldn't even make much sense to them.

MangoToupe · 2025-09-01T07:17:09 1756711029

> Their style of communism/socialism went down with them and is pretty much dead.

Arguably the absolute worst aspects of it are preserved under the state that exists there today. There's certainly a solid through-line through Putin himself.

pqtyw · 2025-09-01T15:04:34 1756739074

Is it? Putin's regime seems like a mostly generic pseudo-fascist dictatorship, except that they have nukes and stuff which they inherited from the USSR.

> absolute worst aspects of it are preserved

While it's obviously horrible, lets not downplay what the USSR was.

Even as late as the 80s when they were pretty much exterminating entire towns and villages in Afghanistan. Of course there are tactical and strategical reasons why they can't do something close in scale to that in Ukraine.

thaumaturgy · 2025-08-14T22:15:31 1755209731

I mean, can I not just spend the money to buy a better society in which to live?

Museums. I love museums. They all need more support. Kids need more places to do field trips.

Libraries ... they are experiencing budget cuts everywhere now as cities prioritize police spending.

Parks.

Homes for people that can't afford them. Seriously, one of the most effective possible cures for homelessness is to set up a program that helps people cover their rent for a month or two if they get into trouble.

Health care. Like, there's got to be a pile of people that need urgent health care and can't afford it, right?

Education. Adult education, too.

Science and research.

And most, maybe all of these, aren't even things that necessarily need an entirely new organization to spearhead them, or some kind of dramatic social change. They are all things that exist right now and need more funding than anything else. You could hire a small team to just look up all kinds of programs all day long and write checks for them and it would be enormously impactful.

I just... the answer to this seems so blindingly obvious to me, and then I read the rest of the comments, and I really wonder when exactly the hacker ethos got co-opted by the crab mentality.

thaumaturgy · 2025-08-13T17:13:18 1755105198

Good to see this. For those that weren't aware, there's been a low-effort solution with https://github.com/dehydrated-io/dehydrated, combined with a pretty simple couple of lines in your vhost config:

    location ^~ /.well-known/acme-challenge/ {
        alias <path-to-your-acme-challenge-directory>;
    }

Dehydrated has been around for a while and is a great low-overhead option for http-01 renewal automation.

Avamander · 2025-08-14T20:30:05 1755203405

The same config also works with certbot. I've used it for years.

andrewmcwatters · 2025-08-13T17:19:55 1755105595

This is really cool, but I find projects that have thousands of people depending on it not cutting a stable release really distasteful.

Edit: Downvote me all you want, that's reality folks, if you don't release v1.0.0, the interface you consume can change without you realizing it.

Don't consume major version 0 software, it'll bite you one day. Convince your maintainers to release stable cuts if they've been sitting on major version 0 for years. It's just lazy and immature practice abusing semantic versioning. Maintainers can learn and grow. It's normal.

Dehydrated has been major version 0 for 7 years, it's probably past due.

See also React, LÖVE, and others that made 0.n.x jumps to n.x.x. (https://0ver.org)

CalVer: "If both you and someone you don't know use your project seriously, then use a serious version."

SemVer: "If your software is being used in production, it should probably already be 1.0.0."

https://0ver.org/about.html

nothrabannosir · 2025-08-13T17:22:27 1755105747

Distasteful by whom, the people depending on it? Surely not… the people providing free software at no charge, as is? Surely not…

Maybe not distasteful by any one in particular, but just distasteful by fate or as an indicator of misaligned incentives or something?

yjftsjthsd-h · 2025-08-14T06:47:13 1755154033

> Distasteful by whom, the people depending on it? Surely not…

Why not?

ygjb · 2025-08-13T17:22:50 1755105770

That's the great thing about open source. If you are not satisfied with the free labour's pace of implementing a feature you want, you can do it yourself!

andrewmcwatters · 2025-08-13T17:24:38 1755105878

Yes, absolutely! I would probably just pick a version to fork, set it to v1.0.0 for your org's production path, and then you'd know the behavior would never change.

You could then merge updates back from upstream.

john01dav · 2025-08-13T17:40:35 1755106835

It's generally easier to just deal with breaking changes, since writing code is faster than gaining understanding and breaking changes in the external api are generally much better documented than internals.

thaumaturgy · 2025-08-13T17:53:59 1755107639

FWIW I have been using and relying on Dehydrated to handle LetsEncrypt automation for something like 10 years, at least. I think there was one production-breaking change in that time, and to the best of my recollection, it wasn't a Dehydrated-specific issue, it was a change to the ACME protocol. I remember the resolution for that being super easy, just a matter of updating the Dehydrated client and touching a config file.

It has been one of the most reliable parts of my infrastructure and I have to think about it so rarely that I had to go dig the link out of my automation repository.

hju22_-3 · 2025-08-13T21:01:17 1755118877

You've been using Dehydrated since its initial commit in December of 2015?

thaumaturgy · 2025-08-14T03:30:55 1755142255

I am pretty sure that this is the thread that introduced me to it: https://news.ycombinator.com/item?id=10681851

Unfortunately, web.archive.org didn't grab an https version of my main site from around that period. My oldest server build script in my current collection does have the following note in it:

    **Get the current version of dehydrated from https://github.com/dehydrated-io/dehydrated **
    (Dehydrated was previously found at https://github.com/lukas2511/dehydrated)

...so I was using it back when it was under the lukas2511 account. Those tech notes however were rescued from a long-dead Phabricator installation, so I no longer have the change history for them, unless I go back and try to resurrect its database, which I think I do still have kicking around in one of my cold storage drives...

But yeah, circa 2015 - 2016 should be about right. I had been hosting stuff for clients since... phew, 2009? So LetsEncrypt was something I wanted to adopt pretty early, because back then certificate renewals were kind of annoying and often not free, but I also didn't want to load whatever the popular ACME client was at the time. Then this post popped up, and it was exactly what I had been looking for, and would have started using it soon after.

edit: my Linode account has been continuously active since October 2009, though it only has a few small legacy services on it now. I started that account specifically for hosting mail and web services for clients I had at the time. So, yeah, my memory seems accurate enough.

dspillett · 2025-08-13T17:29:30 1755106170

Feel free to provide and support a "stable" branch/fork that meets your standards.

Be the change you want to see!

Edit to comment on the edit:

> Edit: Downvote me all you want

I don't generally downvote, but if I were going to I would not need your permission :)

> that's reality folks, if you don't release v1.0.0, the interface you consume can change without you realizing it.

I assume you meant "present" there rather than "consume"?

Anyway, 1.0.0 is just a number. Without relevant promises and a track record and/or contract to back them up breaking changes are as likely there as with any other number. A "version 0.x.x" of a well used and scrutinized open source project is more reliable and trustworthy than something that has just had a 1.0.0 sticker slapped on it.

Edit after more parent edits: or go with one of the other many versioning schemes. Maybe ItIsFunToWindUpEntitledDicksVer Which says "stick with 0.x for eternity, go on, you know you want to!".

juped · 2025-08-14T01:32:52 1755135172

Another person who thinks semver is some kind of eldritch law-magic, serving well to illustrate the primary way in which semver was and is a mistake.

Sacrificing a version number segment as a permanent zero prefix to keep them away is the most practical way to appease semver's fans, given that they exist in numbers and make ill-conceived attempts to depend on semver's purported eldritch law-magics in tooling. It's a bit like the "Mozilla" in browser user-agents; I hope we can stop at one digit sacrificed, rather than ending up like user-agents did, though.

In other words, 0ver, unironically. Pray we do not need 0.0ver.