Hacker Newsnew | past | comments | ask | show | jobs | submit | joepie91_'s commentslogin

This is a defeatist argument. That it's technically possible to abuse things doesn't mean the responsibility needs to fall on the defending party, especially not when that is brought up in response to asking someone to reflect on possibilities for abuse - by that point it starts looking a lot more like a "well you'll just have to deal with it" argument that socially defends the abusers, and a lot less like genuine advice.


> This is a defeatist argument.

No side is getting defeated any time soon. I've been involved in skirmishes on both sides of scraping, and as I said, it's an arms race with no clear winner. To be clear, not all scraping is abuse.

The number of people who'll start scraping because a new tool exists is a negligible (i.e. <0.001 of scraping). Scraping itself is not hard at all, a noob who can copy-paste code from the web or vibe-code a client that can scrape 80-90% of the web. A motivated junior can raise that to maybe 98/99% of the Internet using nothing but libraries that existed before this tool.

> especially not when that is brought up in response to asking someone to reflect on possibilities for abuse

Sir/ma'am - this is hacker news, granted, it's aspirational, but still, hiding information is not the way. As someone who's familiar with the arts, there is nothing new or groundbreaking in this engine. Further, is no inherent moral high ground for the "defenders" either: many anti-scraping methods rely on client fingerprinting and other privacy-destroying techniques, so it's not the existence of the tool or technique, but how one uses it.

>... "well you'll just have to deal with it" argument that socially defends the abusers

The abuse predate the tool, so wishing the tool away is unlikely to help. Scraping is a numbers game on both sides, the best one vam hope for is to defeat the vast majority of the average adversaries, but a few fall through the cracks, the point is to outrun your fellow hiker, not the bear. However, should you encounter an adversery who has specifically chosen you as a target, then victory is far from assured. The usual result is a drawn-out stalemate. Most well-behaved scrapers are left alone.


Have you actually tried blocking these scraper bots? The whole problem is that if you do, they start impersonating normal browsers from residential IPs instead. They actively evade countermeasures.


Isnt everything measures and countermeasures though?

As far as I am aware there is no such thing as a silver bullet anywhere when it comes to security.

Its like moving your SSH port from port 22 to some other random one. Will it stop advanced scripts from scanning your server and finding it? No, but it sure as hell will cut down the noise of unsophisticated connections which means you can focus on the more tough ones.


Finally, a use for CFAA?


> The crawlers could just as well be search engine startups.

And yet they are not. So what does that tell you?


None of the AIs have any 'knowledge' to begin with, so that's an easy one to satisfy.


There are a few ways in which bots can fail to get past such challenges, but the most durable one (ie. the one that you cannot work around by changing the scraper code) is that it simply makes it much more expensive to make a request.

Like spam, this kind of mass-scraping only works because the cost of sending/requesting is virtually zero. Any cost is going to be a massive increase compared to 'virtually zero', at the kind of scale they operate at, even if it would be small to a normal user.


> I think uncontrolled price of cloud traffic - is a real fraud

Yes, it is.

> and way bigger problem then some AI companies that ignore robot.txt.

No, it absolutely is not. I think you underestimate just how hard these AI companies hammer services - it is bringing down systems that have weathered significant past traffic spikes with no issues, and the traffic volumes are at the level where literally any other kind of company would've been banned by their upstream for "carrying out DDoS attacks" months ago.


>I think you underestimate just how hard these AI companies hammer services

Yeas, I completely don't understand this and don't understand comparing this with ddos attacks. There's no difference with what search engines are doing, and in some way it's worse? How? It's simply scraping data, what significant problems may it cause? Cache pollution? And thats'it? I mean even when we talking about ignoring robots.txt (which search engines are often doing too) and calling costly endpoints - what is the problem to add to those endpoints some captcha or rate limiters?


What likely happened here is that they were raising prices due to increased costs for energy and various other costs, and if they hadn't made this change then they would have had to increase the price more, so relative to that it keeps it cheaper for low-traffic customers - and they just communicated this poorly.


> Git, or any file-server based software, is not built to scale up well in today's world. Large Git hosters have to invest entire teams to manage their file servers and their Git front-end systems to create a web-scale service on top of a file-server based piece of software. I'm just skipping to the part where you don't need that anymore because Azure / GCP / AWS PaaS services already handle that.

This doesn't really make any sense. Most people are not "large Git hosters" (and so for them there is no functional difference between "outsourcing Git hosting" and "outsourcing to a Grace hoster that is outsourcing file handling", and even for those who are large Git hosters, they're still going to need a team of sysadm- sorry, "cloud experts" to manage the AWS/Azure/whatever infrastructure.

What actual material benefit is being provided here? It seems to me like it just trades "administrating a standard hosting environment" in for "administrating a vendor-locked hosting environment".


> This doesn't really make any sense. Most people are not "large Git hosters"

I do work for GitHub, so I do know what it takes.

Most people don't run their own Git servers, they use GitHub / GitLab / Azure DevOps / etc. and I intend to create something that's easy for those hosters to adopt.

Grace is also designed to be easily deployable to your local or virtual server environment using Kubernetes - and if you're large enough to want your own version control server, you're already running Kubernetes somewhere - so, party on if you want to do that, but I expect the number of organizations running their own version control servers to be low and shrinking over time.

And Git isn't going anywhere. If that's what you to run on your own server and client, I won't stop you.


The license allowing for something does not mean you are okay with anyone being part of your community.


Sure, but that's an orthogonal point to the one OP made isn't it? Contributing to open source projects is incompatible with not wanting someone else to use your work based on ideological differences. Perhaps contributors don't think about this until they're faced with a situation that makes them uncomfortable and I sympathize with that and maybe we should start adding disclosures that say "your work may be used by entities you do not want using your work".


That is about licenses, not about community participation, let alone community policy or sponsorships. Those are all very different things with very different considerations.

(Signed, someone who absolutely does not want military contractors in their community, but feels that a license is the wrong place to enforce that.)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: