Hacker Newsnew | past | comments | ask | show | jobs | submit | tommek4077's commentslogin

But why?

No humans, no point.

But AI scraping doesn't remove humans...?

Even if humans make up a smaller proportion of your traffic, they're still the same number in absolute terms.


How do they get overloaded? Is the website too slow? I have a quite big wiki online and barely see any impact from bots.

A year or two ago I personally encountered scraping bots that were scraping every possible resultant page from a given starting point. So if it scraped a search results page it would also scrape every single distinct combination of facets on that search (including nonsensical combinations e.g. products that match the filter "products where weight<2lbs AND weight>2lbs")

We ended up having to block entire ASNs and several subnets (lots from Facebook IPs, interestingly)


I have encountered this same issue with faceted search results and individual inventory listings.

If you have a lot of pages, AI bots will scrape every single one on a loop - wiki's generally don't have anywhere near the number of pages as an incremented entity primary id. I have a few million pages on a tiny website and it gets hammered by AI bots all day long. I can handle it, but it's a nuisance and they're basically just scraping garbage (statistics pages of historical matches or user pages that have essentially no content).

Many of them don't even self-identify and end up scraping with shrouded user-agents or via bot-farms. I've had to block entire ASNs just to tone it down. It also hurts good-faith actors who genuinely want to build on top of our APIs because I have to block some cloud providers.

I would guess that I'm getting anywhere from 10-25 AI bot requests (maybe more) per real user request - and at scale that ends up being quite a lot. I route bot traffic to separate pods just so it doesn't hinder my real users' experience[0]. Keep in mind that they're hitting deeply cold links so caching doesn't do a whole lot here.

[0] this was more of a fun experiment than anything explicitly necessary, but it's proven useful in ways I didn't anticipate


Even moderately sized wikis have a huge number of different page versions which can all be accessed individually.

How many requests per second do you get? I also see a lot of bot traffic but nowhere near to hit the servers significantly, and i render most stuff on the server directly.

Around a hundred per second at peak. Even though my server can handle it just fine, it muddies up the logs and observability for something I genuinely do not care about at all. I only care about seeing real users' experience. It's just noise.

There’s a lot of factors. Depends how well your content lends itself to being cached by a CDN, the tech you (or your predecessors) chose to build it with, and how many unique pages you have. Even with pretty aggressive caching, having a couple million pages indexed adds up real fast. Especially if you weren’t fortunate enough to inherit a project using a framework that makes server side rendering easy.

In these discussions no one will admit this, but the answer is generally yes. Websites written in python and stuff like that.

It's not "written too slow" if you e.g. only get 50 users a week, though. If bots add so much load that you need to go optimise your website for them, then that's a bot problem not a website problem.

Yes yes, definitely people don’t know what they’re doing and not that they’re operating on a scale or problem you are not. Metabrainz cannot cache all of these links as most of them are hardly ever hit. Try to assume good intent.

But serving HTML is unbelievably cheap, isn't it?

Run 72,000 database queries to generate a bunch of random HTML files no one has asked for in five years is not, especially compared to downloading the files designed for it.

It adds up very quickly.

The worse thing is calendar/schedule. Many crawler tries to load every single day, with day view, week view and month view. Those pages are dynamically generated and virtually limitless

The API seems to be written in Perl: https://github.com/metabrainz/musicbrainz-server

Time for a vinyl-style Perl revival ...

Future markets give traders leverage of 100x sometimes or more. Margin requirements are much lower than trading spot.


Margin requirements for trading spot are zero, though initial capital requirements are obviously, well, whatever spot is.

Futures contracts aren't just pieces of paper traded between people, they are actual promises to pay for physical delivery of the underlying.

It's not surprising to me that crypto people consider them nothing more than leveraged gambling slips but that's really not how one should think about them. Personally I think crypto needs far heavier regulation than it gets.


Ever heard of liquidations?


Yes, it’s really ‘weird’ that they refuse to share any details. Completely unlike AWS, for example. As if being open about issues with their own product wouldn’t be in their best interest. /s


What is really at risk?


Maybe the instances are shared between users via sharding or are re-used and not properly cleaned.

And maybe they contain the memory of the users and/or the documents uploaded?


And what do you expect to get? Some arbitrary uninteresting corporate paper, a homework, someones fanfiction.

Again, what is the risk?


Probably you’re being sarcastic to show that those AI companies don’t give a damn about our data. Right ?


Couldnt this be a first step before further escalation ?


And then what? What is the risk?


I guess a sandbox escape, something, profit?


Dont OpenAI have a ton of data on all of its users ?


And what is at risk? Someone seeing someones else fanfiction? Or another reworded business email? Or the vacancy report of sone guy in southern germany?


This is a wild take and I’m not sure where to begin. What if I leaked your medical data, or your emails, or your browser history. What’s at risk? Your data means nothing to me.


No it is not. Outside this strange bubble on hacker news, no ine really cares or has ever heard of the creator.

They just use wordpress.


1st rule of hacking: don't write your freaking name on it!


Plot twist: Nobody who is in charge should care.

Leave the no to the naysayers.

Ship your app, generate traffic, usage, income. Leave the discussions to other people.


Do that at $BigCorp and Legal will eat you alive, if not fired.

Long ago I went through the company-approved process to link to SQLite and they had such a long list of caveats and concerns that we just gave up. It gave me a new understanding of how much legal risk a company takes when they use a third-party library, even if it's popular and the license is not copyleft.


Unless you are now involved in a lawsuit that asks for a hypothetical 50% of your income for using a tech very similar to their and they speculate its been stolen and not permitted by their license and even if you know you are going to win/or that it doesn't affect you still have to spend money on the lawyers fighting it.


Commenting on this to mark it in my feed for later reference. Well said!


Best is to go into the woods and live with bees.


Yes, because the world is just binary like that. You can only choose on or the other... /s


How about we invent SAI and terraform superearths in Milky Way to atone for our sins here?


Just copy paste your error message and do what chat gpt tells you.


How well has that worked for you?


For me it's usually about useless. It "may" be this, "may" be that, and no clue about what information it would need for a more accurate diagnosis.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: