More

tommek4077 · 2026-01-13T23:07:13 1768345633

But why?

dgxyz · 2026-01-14T00:05:06 1768349106

No humans, no point.

crazygringo · 2026-01-14T05:57:02 1768370222

But AI scraping doesn't remove humans...?

Even if humans make up a smaller proportion of your traffic, they're still the same number in absolute terms.

tommek4077 · 2026-01-13T22:57:15 1768345035

How do they get overloaded? Is the website too slow? I have a quite big wiki online and barely see any impact from bots.

stinky613 · 2026-01-13T23:10:34 1768345834

A year or two ago I personally encountered scraping bots that were scraping every possible resultant page from a given starting point. So if it scraped a search results page it would also scrape every single distinct combination of facets on that search (including nonsensical combinations e.g. products that match the filter "products where weight<2lbs AND weight>2lbs")

We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)

chao- · 2026-01-13T23:48:12 1768348092

I have encountered this same issue with faceted search results and individual inventory listings.

switz · 2026-01-13T23:04:23 1768345463

If you have a lot of pages, AI bots will scrape every single one on a loop - wiki's generally don't have anywhere near the number of pages as an incremented entity primary id. I have a few million pages on a tiny website and it gets hammered by AI bots all day long. I can handle it, but it's a nuisance and they're basically just scraping garbage (statistics pages of historical matches or user pages that have essentially no content).

Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.

I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.

[0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate

account42 · 2026-01-15T12:29:14 1768480154

Even moderately sized wikis have a huge number of different page versions which can all be accessed individually.

tommek4077 · 2026-01-13T23:38:54 1768347534

How many requests per second do you get? I also see a lot of bot traffic but nowhere near to hit the servers significantly, and i render most stuff on the server directly.

switz · 2026-01-15T21:28:19 1768512499

Around a hundred per second at peak. Even though my server can handle it just fine, it muddies up the logs and observability for something I genuinely do not care about at all. I only care about seeing real users' experience. It's just noise.

roblh · 2026-01-13T23:03:49 1768345429

There’s a lot of factors. Depends how well your content lends itself to being cached by a CDN, the tech you (or your predecessors) chose to build it with, and how many unique pages you have. Even with pretty aggressive caching, having a couple million pages indexed adds up real fast. Especially if you weren’t fortunate enough to inherit a project using a framework that makes server side rendering easy.

blell · 2026-01-13T23:03:28 1768345408

In these discussions no one will admit this, but the answer is generally yes. Websites written in python and stuff like that.

Qwertious · 2026-01-14T01:15:51 1768353351

It's not "written too slow" if you e.g. only get 50 users a week, though. If bots add so much load that you need to go optimise your website for them, then that's a bot problem not a website problem.

tclancy · 2026-01-13T23:07:45 1768345665

Yes yes, definitely people don’t know what they’re doing and not that they’re operating on a scale or problem you are not. Metabrainz cannot cache all of these links as most of them are hardly ever hit. Try to assume good intent.

tommek4077 · 2026-01-13T23:35:40 1768347340

But serving HTML is unbelievably cheap, isn't it?

tclancy · 2026-01-14T18:13:03 1768414383

Run 72,000 database queries to generate a bunch of random HTML files no one has asked for in five years is not, especially compared to downloading the files designed for it.

chlorion · 2026-01-14T02:16:22 1768356982

It adds up very quickly.

j16sdiz · 2026-01-14T06:34:14 1768372454

The worse thing is calendar/schedule. Many crawler tries to load every single day, with day view, week view and month view. Those pages are dynamically generated and virtually limitless

kpcyrd · 2026-01-13T23:04:08 1768345448

The API seems to be written in Perl: https://github.com/metabrainz/musicbrainz-server

jjgreen · 2026-01-13T23:17:17 1768346237

Time for a vinyl-style Perl revival ...

tommek4077 · 2025-12-31T11:24:54 1767180294

Future markets give traders leverage of 100x sometimes or more. Margin requirements are much lower than trading spot.

johnnienaked · 2025-12-31T19:30:56 1767209456

Margin requirements for trading spot are zero, though initial capital requirements are obviously, well, whatever spot is.

Futures contracts aren't just pieces of paper traded between people, they are actual promises to pay for physical delivery of the underlying.

It's not surprising to me that crypto people consider them nothing more than leveraged gambling slips but that's really not how one should think about them. Personally I think crypto needs far heavier regulation than it gets.

tommek4077 · 2026-01-01T11:37:59 1767267479

Ever heard of liquidations?

tommek4077 · 2025-12-05T08:56:45 1764925005

Yes, it’s really ‘weird’ that they refuse to share any details. Completely unlike AWS, for example. As if being open about issues with their own product wouldn’t be in their best interest. /s

tommek4077 · 2025-03-12T22:24:50 1741818290

What is really at risk?

Garlef · 2025-03-13T12:22:43 1741868563

Maybe the instances are shared between users via sharding or are re-used and not properly cleaned.

And maybe they contain the memory of the users and/or the documents uploaded?

tommek4077 · 2025-03-13T22:09:14 1741903754

And what do you expect to get? Some arbitrary uninteresting corporate paper, a homework, someones fanfiction.

Again, what is the risk?

ttoinou · 2025-03-14T09:29:42 1741944582

Probably you’re being sarcastic to show that those AI companies don’t give a damn about our data. Right ?

ttoinou · 2025-03-12T22:28:59 1741818539

Couldnt this be a first step before further escalation ?

tommek4077 · 2025-03-13T22:09:33 1741903773

And then what? What is the risk?

PUSH_AX · 2025-03-12T22:27:32 1741818452

I guess a sandbox escape, something, profit?

ttoinou · 2025-03-12T22:29:29 1741818569

Dont OpenAI have a ton of data on all of its users ?

tommek4077 · 2025-03-13T22:10:52 1741903852

And what is at risk? Someone seeing someones else fanfiction? Or another reworded business email? Or the vacancy report of sone guy in southern germany?

PUSH_AX · 2025-03-14T21:17:12 1741987032

This is a wild take and I’m not sure where to begin. What if I leaked your medical data, or your emails, or your browser history. What’s at risk? Your data means nothing to me.

tommek4077 · on Jan 13, 2025

No it is not. Outside this strange bubble on hacker news, no ine really cares or has ever heard of the creator.

They just use wordpress.

tommek4077 · on Jan 9, 2025

1st rule of hacking: don't write your freaking name on it!

tommek4077 · on Dec 24, 2024

Plot twist: Nobody who is in charge should care.

Leave the no to the naysayers.

Ship your app, generate traffic, usage, income. Leave the discussions to other people.

david-gpu · on Dec 24, 2024

Do that at $BigCorp and Legal will eat you alive, if not fired.

Long ago I went through the company-approved process to link to SQLite and they had such a long list of caveats and concerns that we just gave up. It gave me a new understanding of how much legal risk a company takes when they use a third-party library, even if it's popular and the license is not copyleft.

suryajena · on Dec 24, 2024

Unless you are now involved in a lawsuit that asks for a hypothetical 50% of your income for using a tech very similar to their and they speculate its been stolen and not permitted by their license and even if you know you are going to win/or that it doesn't affect you still have to spend money on the lawyers fighting it.

SavageBeast · on Dec 24, 2024

Commenting on this to mark it in my feed for later reference. Well said!

tommek4077 · on Dec 6, 2024

Best is to go into the woods and live with bees.

creesch · on Dec 6, 2024

Yes, because the world is just binary like that. You can only choose on or the other... /s

jumping_frog · on Dec 6, 2024

How about we invent SAI and terraform superearths in Milky Way to atone for our sins here?

tommek4077 · on Aug 28, 2024

Just copy paste your error message and do what chat gpt tells you.

amelius · on Aug 28, 2024

How well has that worked for you?

BarryMilo · on Aug 28, 2024

For me it's usually about useless. It "may" be this, "may" be that, and no clue about what information it would need for a more accurate diagnosis.