Hacker Newsnew | past | comments | ask | show | jobs | submit | tolmasky's commentslogin

Is having a real robot creepy? I don't know. Is having a robot operated by a human creepy and scary? Absolutely yes.

We've seen that people behave worse when you introduce indirection. People act worse on the internet. Soldiers have an easier time killing with drones than in person. The ethical issue is in both directions: its inhumane to the operator, but I also don't want to feel like a fake person on a video screen to them.

This is then exacerbated when you realize that the people operating this machine are almost certainly not being paid well, creating obvious and legitimate negative incentives. Then you plop them into the households of people with the insane wealth required to afford this. You might think that I have just described the situation with maids (and to some extent, I agree! I have never really felt comfortable that dynamic either), but this is actually different, because you are adding in the indirection and making actions and interactions feel less "real" to both parties: the clients are likely to treat the robots worse than they would a human helper, and the operators may feel these rude clients they see on their monitors aren't as real as the people around them.


I think they intend this as a step for getting enough training data in order not to need a human in the loop. I have actually been following 1x and what was Halodi(they merged at some point) for a while and their intention is full autonomy.

Besides having someone strange in your house, you also have the company probably recording stuff. Privacy wise... It's worse. But that makes me not as concerned with safety since it any misbehavior would quickly be detected.


> since it any misbehavior would quickly be detected

I can think of at least one very prominent company that is currently recording, at scale, its users in its quest for full autonomy. As best I can tell, that company simply deletes videos when they are inconvenient.


it’s even worse, with maids, given the socioeconomic dynamics, even if they are paid low, they will be paid “local-market-rates” where by definition they will have to earn enough to (maybe barely) live nearby the people paying them,

teleoperated robots don’t have that incentive and can pay “international low” levels of compensation


But then it can be more, so they can make more than a maid, for example in some countries call center jobs for bilinguals people make double the minimum wage of the local rate.


Plenty of opportunity to use forced labourers in a DIFFERENT country while complying with all the immigration laws possible, and also saving the owners from having to meet real poor people. (I hope this will not work well...)


Right, but the low income countries could also frame it as a new way to earn a living. I think avoiding giving jobs to those countries gives them no help.


So its servants then.

Even needing a clearer feels living unsustainably - its living in a house too big to maintain.

And if the answer is that works takes up too much time, yes we work too many hours.


My work is much more valuable than moving a broom around and washing some dishes. Anyone has the basic skills to clean a house, very few people have the skills to do advanced maths or physics, or engineering or even some forms of Mechanical labor.

The cleaning lady is not some rocket scientist, she is someone that has very low skills and therefore does low skilled labor: cleaning houses.


If her job is so worthless, why not go without and accept the (by definition) small diminishing utility.

This is false, having a clean house, clean dishes, cooked food is extremely valuable, but this is mot captured by money, because half of the population were basically indentured servants that were culturally expected to provide this work for free.


It wasn't for free. It was their work. Which, on average and until very recently, was much, much safer, and easier to do than the work of the other half of the population that had to leave the house and seek work to get a pay and allow the that half to stay at home.


That "on average" is load bearing, does it include Queen Victoria? Just like ICE "allows" people to stay at Aligator Alcatraz. They should be paying rent!

Of course, the indentured servants payed rent in kind, with their bodies, whether they like it or not.


> It wasn't for free

Say whaat? The woman's father literally paid good money to have her taken away. Everyone but the woman saw cash changing hands, but she was legally barred from owning property.

Freedom was for men.


Indulging for a moment that fantasy of yours about the purpose of dowries:

If the present “owner” paid “good money” to take her away, doesn’t that mean she was a liability instead of an asset?


Yep, we’re going to have robots molesting women and kids.


If this ever gets popular then sellers will “optimize” their product listings to exploit the LLM (a “soft” prompt injection if you will). This will definitely be the case in marketplaces (like Amazon and Walmart). It’ll turn the old boring task of shopping into a fun puzzle to spot the decoy item or overpriced product.


It could happen but I am not building an Amazon shopping list. It’s about building a list from a physical store that will get delivered to me in a few hours. This is for shopping through a retailer, not the market place.

I do think it’s a concern but I think it’s no different than the exact problem that exists today in these marketplace operations like Amazon. I know for me I will actually split my shopping up and often shop less with an Amazon and more with a Walmart because of it.


Perfect number to make H1Bs a tool that is out of reach for startups but still meaningful for large entrenched corporations. Nailed it. Maybe they can even waive the fee if you give the US government 10% of your company.


University hiring is basically rekt. Throwing out baby with the bath water per usual with this admin...


How much does university hiring depend on H-1B? I would expect much of that comes through O-1 or EB-1/2/3, no?


H-1B is the default visa for international faculty hires. You can get it in a few months with relatively little effort. O-1 is more expensive, takes longer to get, and requires more effort from the applicant. Then there is the subjective approval process that involves a degree of risk, and in the end, you get a slightly inferior visa.

Green cards are almost useless for hiring, as the processing times are too long. "We would like to offer you this position, but conditionally. We still need a year or two to handle the bureaucracy, and we can't say for sure if we are actually allowed to hire you. Please don't accept another position meanwhile."


No, pretty much all professors who used to be international students or postdocs are on H1B.


my understanding is post docs are virtually all on J1 visas, which is a meaningful part of uni hiring


> used to be


So... Now those spots will have to go to American students and grads?


Some will.

Most won't be filled at all.


+1 This will also reduce demand for these programs from international students - make tuition more expensive for locals. Asking to consider 2nd/3rd order effect seems like a bit too much for a median hn poster though


lots of immigrant kids are in uni now. all my cousins are doing cs now. look at latest batch of yc founders.


An equity minimum would deal with this.


I'm glad the em dash is getting properly shit on these days, if for unrelated reasons. I've never liked it. I hate the stupid spacing rules around it. It never looks right to put no spaces around the em dash, and probably breaks all sorts of word-splitting code that's based on "\s". Where else does punctuation without spaces not mean a single word? Hyphens without spaces is a compound word: it counts as one. Imagine if the correct use of a colon was to not put spaces around it:like this. Do you like that? Of course not.

But I think worst of all it just gives me the fucking creeps, some uncanny-valley bullshit. I see hyphens a million times a day then out of nowhere comes this creepy slender-man looking motherfucker that's just a little bit too long than you'd expect or like, and is always touching all the letters around it when it shouldn't need to. It stands out looking like a weird print error... on my screen! Hopefully it keeps building a worse and worse reputation.


Does no one else find it weird seeing anything from this administration "anti-Bitcoin" at all? I wouldn't be surprised by this headline during a previous administration, but generally speaking, this administration has been very Bitcoin-friendly (and Bitcoin institutions friendly right back). To be clear, the simplest answer is "sure but that doesn't mean they have to agree on everything". But I would like to propose that if you ask the simple question of "who does this benefit?" it may suggest we are witnessing a different phenomenon here.

I think this might be the first indication that what we currently call "institutional Bitcoin supporters" are not "Bitcoin supporters" at all, or rather, what they call "Bitcoin" is not what you and I call "Bitcoin". Services like Coinbase and BTC ETFs don't really suffer from this development at all. In fact, I think it's quite obvious that obviously benefit from something like this (at least from the first-order effects). What's the alternative to self custody? Well... third-party custody. Especially since they are already bound up by KYC rules, right? Their is a cynical reading that there's nothing inconsistent with this development if you consider "institutional Bitcoin's" goals to primarily be replacing existing financial power structures with themselves. "Bitcoin" is just a means to an end. Their goals were only incidentally aligned with individual BTC holders since they were previously in similar circumstances as the "out group". Previous administrations were as suspicious of "Bitcoin companies" as any individual Bitcoin holder, perhaps even more so. But that's not the case anymore. Bitcoin companies have successfully been brought into the fold, so it's not even that they're necessarily "betraying" the values of Bitcoin true believers, you might argue that interpretation of shared values was entirely inferred to begin with.

Critically though, I think an important consequence of this is that Bitcoin purists and skeptics should realize that they arguably now have more in common than not, at least in the immediate term, and may be each other's best allies. In my experience, for most the existence of Bitcoin, its skeptics haven't really seen Bitcoin as a "threat." Instead, to admittedly generalize, their critiques have been mostly about Bitcoin being "broken" or "silly" or "misunderstanding the point of centralized systems", etc. These aren't really "oppositional" positions in the traditional "adversarial sense," more dismissive. In fact, the closest thing to an "active moral opposition" to Bitcoin that I've seen is an environmental one. IOW, Bitcoin true believers think about Bitcoin way more than Bitcoin skeptics do. Similarly, Bitcoin true believers really have nothing against skeptics other than... the fact that they occasionally talk shit about Bitcoin? IOW, Bitcoin skeptics are not "the natural enemy Bitcoin was designed to defeat".

But if you think about it, "institutional Bitcoin" sort of embodies something both these camps generally have hated since before Bitcoin. Whether you believe Bitcoin to be a viable answer or not, it is undeniable that the "idea" of Bitcoin is rooted in the distrust of these elitist financial institutions, that evade accountability, benefit from special treatment, and largely get to rig the larger system in their favor. Similarly, I don't think Bitcoin skeptics like these institutions or are "on their side". In fact, perhaps they'd argue that they predicted that Bitcoin wouldn't solve any of this and would just be another means of creating them. But IMO what they should both realize is that the most important threat right now is these institutional players. They are in fact, only "nominally" Bitcoin in a deep sense. From the perspective of true believers, their interests are actually in now way "essentially" aligned with any "original Bitcoin values," and from the perspective of skeptics, the threat they pose has very little to do with their use of "the Bitcoin blockchain".

They are arguably just another instantiation of the "late stage capitalist" playbook of displacing an existing government service in order to privatize its rewards. Coinbase could be argued to have more in common with Uber than Ledger wallets. Instead of consolidating and squeezing all the value from taxis though, the play is to do the same with currency itself. It is incidental that Uber happened to be so seemingly "government averse". In this context, it's actually helpful to cozy up to the government and provide the things government departments want that make no difference to fintech's bottom line (such as KYP). In fact, that might be their true value proposition. Bitcoin only enters the conversation because in order to replace a currency, you do... need a currency. Bitcoin was convenient. It was already there, it had a built-in (fervent) user base that was happy to do your proselytizing for you, and even saw you as a good "first step" for normies that couldn't figure out to manage their own wallet. The Bitcoin bubble was already there, why fight it when you can ride it?

Again, I think this is highly likely to be against the values of Bitcoin true believers and skeptics alike, and I also think that if the above is true, it represents an actual danger to us all. Recent events with credit card processors have already demonstrated that payment systems have proven to be incredibly efficient tools at stifling speech. In other words, this is arguably an "S-tier threat", on par with or perhaps worse than any sort of internet censorship or net neutrality. If so, we should treat it as such and work together.


Generally speaking, the second you realize a technology/process/anything has a hard requirement that individuals independently exercise responsibility or self-control, with no obvious immediate gain for themselves, it is almost certain that said technology/process/anything is unsalvageable in its current form.

This is in the general case. But with LLMs, the entire selling point is specifically offloading "reasoning" to them. That is quite literally what they are selling you. So with LLMs, you can swap out "almost certain" in the above rule to "absolutely certain without a shadow of a doubt". This isn't even a hypothetical as we have experimental evidence that LLMs cause people to think/reason less. So you are at best already starting at a deficit.

But more importantly, this makes the entire premise of using LLMs make no sense (at least from a marketing perspective). What good is a thinking machine if I need to verify it? Especially when you are telling me that it will be a "super reasoning" machine soon. Do I need a human "super verifier" to match? In fact, that's not even a tomorrow problem, that is a today problem: LLMs are quite literally advertised to me as a "PhD in my pocket". I don't have a PhD. Most people would find the idea of me "verifying the work of human PhDs" to be quite silly, so how does it make any sense that I am in any way qualified to verify my robo-PhD? I pay for it precisely because it knows more than I do! Do I now need to hire a human PhD to verify my robo-PhD?" Short of that, is it the case that only human PhDs are qualified to use robo-PhDs? In other words, should LLms exclusively be used for things the operator already knows how to do? That seems weird. It's like a Magic 8 Ball that only answers questions you already know the answer to. Hilariously, you could even find someone reaching the conclusion of "well, sure, a curl expert should verify the patch I am submitting to curl. That's what submitting the patch accomplishes! The experts who work on curl will verify it! Who better to do it than them?". And now we've come full circle!

To be clear, each of these questions has plenty of counter-points/workarounds/etc. The point is not to present some philosophical gotcha argument against LLM use. The point rather is to demonstrate the fundamental mismatch between the value-proposition of LLMs and their theoretical "correct use", and thus demonstrate why it is astronomically unlikely for them to ever be used correctly.


I use coding LLMs as a mix of:

1. a better autocomplete -- here the LLM models can make mistakes, but on balance I've found this useful, especially when constructing tests, writing output in a structured format, etc.;

2. a better search/query tool -- I've found answers by being able to describe what I'm trying to do where a traditional search I have to know the right keywords to try. I can then go to the documentation or search if I need additional help/information;

3. an assistant to bounce ideas off -- this can be useful when you are not familiar with the APIs or configuration. It still requires testing the code, seeing what works, seeing what doesn't work. Here, I treat it in the same way as reading a blog post on a topic, etc. -- the post may be outdated, may contain issues, or may not be quite what I want. However, it can have enough information for me to get the answer I need -- e.g. a particular method which I can then consult docs (such as documentation comments on the APIs) etc. Or it lets be know what to search on Google, etc..

In other words, I use LLMs as part of the process like with going to a search engine, stackoverflow, etc.


> a better autocomplete

This is 100% what I use Github Copilot for.

I type a function name and the AI already knows what I'm going to pass it. Sometimes I just type "somevar =" and it instantly correctly guesses the function, too, and even what I'm going to do with the data afterwards.

I've had instances where I just type a comment with a sentence of what the code is about to do, and it'll put up 10 lines of code to do it, almost exactly matching what I was going to type.

The vibe coders give AI-code generation a bad name. Is it perfect? Of course not. It gets it wrong at least half the time. But I'm skilled enough to know when it's wrong in nearly an instant.


GPT-5 Pro catches more bugs in my code than I do now. It is very very good.

LLMs are pretty consistent about what types of tasks they are good at, and which they are bad it. That means people can learn when to use them, and when to avoid them. You really don't have to be so black-and-white about it. And if you are checking the LLM's results, you have nothing to worry about.

Needing to verify the results does not negate the time savings either when verification is much quicker than doing a task from scratch.

My code is definitely of higher quality now that I have GPT-5 Pro review all my changes, and then I review my code myself as well. It seems obvious to me that if you care, LLMs can help you produce better code. As always, it is only people who are lazy who suffer. If you care about producing great code, then LLMs are a brilliant tool to help you with just that, in less time, by helping with research, planning, and review.


This doesn't really address the point that is currently being argued I think, so much so that I think your comment is not even in contention with mine (perhaps you didn't intend it to be!). But for lack of a better term, you are describing a "closed experience". You are (to some approximation) assuming the burden of your choices here. You are applying the tool to your work, and thus are arguably "qualified" to both assess the applicability of the tool to the work, and to verify the results. Basically, the verification "scales" with your usage. Great.

The problem that OP is presenting is that, unlike in your own use, the verification burden from this "open source" usage is not taken on by the "contributors", but instead "externalized" to maintainers. This does not result in the same "linear" experience you have, their experience is asymmetric, as they are now being flooded with a bunch of PRs that (at least currently) are harder to review than human submissions. Not to mention that also unlike your situation, they have no means to "choose" not to use LLMs if they for whatever reason discover it isn't a good fit for their project. If you see something isn't a good fit, boom, you can just say "OK, I guess LLMs aren't ready for this yet." That's not a power maintainers have. The PRs will keep coming as a function of the ease to create them, not as a function of their utility. Thus the verification burden does not scale with the maintainer's usage. It scales with the sum of everyone who has decided they can ask an LLM to go "help" you. That number both larger and out of their control.

The main point of my comment was to say that this situation is not only to be expected, but IMO essential and inseparable from this kind of use, for reasons that actually follow directly from your post. When you are working on your own project, it is totally reasonable to treat the LLM operator as qualified to verify the LLMs outputs. But the opposite is true when you are applying it to someone else's project.

> Needing to verify the results does not negate the time savings either when verification is much quicker than doing a task from scratch.

This is of course only true because of your existing familiarity with of the project you are working on. This is not a universal property of contributions. It is not "trivial" for me to verify a generated patch in a project I don't understand, for reasons ranging from things as simple as the fact that I have no idea what the code contribution guidelines are (who am I to know if I am even following the style guidelines) to things as complicated as the fact that I may not even be familiar with the programming language the project is written in.

> And if you are checking the LLM's results, you have nothing to worry about.

Precisely. This is the crux of the issue -- I am saying that in the contribution case, it's not even about whether you are checking the results, it's that you arguably can't meaningfully check the results (unless you of course essentially put in nearly the same amount of work as just writing it from scratch).

It is tempting to say "But isn't this orthogonal to LLMs? Isn't this also the case with submitting PRs you created yourself?" No! It is qualitatively different. Anyone who has ever submitted a meaningful patch to a project they've never worked on before has had the experience of having to familiarize themselves with the relevant code in order to create said patch. The mere act of writing the fix organically "bootstraps" you into developing expertise in the code. You will if nothing else develop an opinion on the fix you chose to implement, and thus be capable of discussing it after you've submitted it. You, the PR submitter, will be worthwhile to engage with and thus invest time in. I am aware that we can trivially construct hypothetical systems where AI agents are participating in PR discussions and develop something akin to a long term "memory" or "opinion" -- but we can talk about that experience if and when it ever comes into being, because that is not the current lived experience of maintainers. It's just a deluge of low quality one-way spam. Even the corporations that are specifically trying to implement this experience just for their own internal processes are not particularly... what's a nice way to put this, "satisfying" to work with, and that is for a much more constrained environment, vs. "suggesting valuable fixes to any and all projects".


I'm not advocating that the verification should be on the maintainer. It should definitely be on the contributor/submitter to verify that what they are submitting is correct to the best of their abilities.

This applies if the reporter found the bug themselves, used a static analysis tool like Coverity, used a fuzzing tool, used valgrind or similar, used an LLM, or some other mechanism to identify the issue.

In each case the reporter needs to at a minimum check if what they found is actually an issue and ideally provide a reproducible test case ("this file causes the application to crash", etc.), logs if relevant, etc.


I was arguing against your dismissal of the value proposition of LLMs. I wasn't arguing about the case of open-source maintainers getting spammed by low-quality issues and PRs (where I think we agree on a lot of points).

The way that you argued that the value proposition of LLMs makes no sense takes a really black-and-white view of modern AI. There are actually a lot of tasks where verification is easier than doing the task yourself, even in areas where you are not an expert. You just have to actually do the verification (which is the primary problem with open-source maintainers getting spammed by people who do not verify anything).

For example, I have recently been writing a proxy for work, but I'm not that familiar with networking setups. But using LLMs, I've been able to get to a robust solution that will cover our use-cases. I didn't need to be an expert in networking. My experience in other areas of computer science combined with LLMs to help me research let me figure out how to get our proxy to work. Maybe there is some nuance I am missing, but I can verify that the proxy correctly gets the traffic and I can figure out where it needs to go, and that's enough to make progress.

There is some academic purity lost in this process of using LLMs to extend the boundary of what you can accomplish. This has some pretty big negatives, such as allowing people with little experience to create incredibly insecure software. But I think there are a lot more cases where if you verify the results you get, and you don't try to extend too far past your knowledge, it gives people great leverage to do more. This is to say, you don't have to be an expert to use an LLM for a task. But it does help a lot to have some knowledge about related topics at least, to ground you. Therefore, I would say LLMs can greatly expand the scope of what you can do, and that is of great value (even if they don't help you do literally everything with a high likelihood of success).

Additionally, coding agents like Claude Code are incredible at helping you get up-to-speed with how an existing codebase works. It is actually one of the most amazing use-cases for LLMs. It can read a huge amount of code and break it down for you so you can start figuring out where to start. This would be of huge help when trying to contribute to someone else's repository. LLMs can also help you with finding where to make a change, writing the patch, setting up a test environment to verify the patch, looking for project guidelines/styleguides to follow, helping you to review your patch against those guidelines, and helping you to write the git commit and PR description. There's so many areas where they can help in open-source contributions.

The main problem in my eyes is people that come to a project and make a PR because they want the "cred" of contributing with the least possible effort, instead of because they have an actual bug/feature they want to fix/add to the project. The former is noise, but the latter always has at least one person who benefits (i.e., you).


In my experience most of the work a programmer does just isn't very difficult. LLMs are perfectly fine for that.


There’s some corollary here to self-driving cars which need constant baby-sitting.


How strange that the article never links directly to the Helix editor. I usually immediately open the homepage of whatever a blog post is talking about as a background tab to be able to click back and forth, or to be able to immediately figure out what the thing being talked about is, but no luck here, except for some decoys (like the "helix" link next to the title which is just the tag "helix" which sends you to a page with all the posts tagged with "helix", which happens to just be this one post).

I of course quickly just googled it myself and found the page, and so afterward I went to the source of the blog post and searched for the URL to confirm that it wasn't actually linked to anywhere. Turns out that about three quarters of the way down, in the "Key Bindings" section, there is a link to the Helix keymappings documentation page, which appears to be the closest thing to a direct homepage link.

Anyways, no nefarious intent being implied of course, I just found it sort of interesting. I am pretty certain it just got accidentally left out, or maybe the project didn't have a homepage back in December of 2024 when this was originally written? Although the github page isn't directly linked either (only one specific issue in the github tracker).

Oh, and here's a link to their page: https://helix-editor.com/

And github page: https://github.com/helix-editor/


Yes, it was pure accident! I surely had the helix homepage and documentation most of the time while writing this, but only thought to link that one bit of documentation! When I get to a computer next I'll update it with a link, because that would be useful.


Not linking to stuff is the new normal. Many subreddits ban you if you post a link to source. Tweets no longer contain links - you need to click on tweet to see the next ones that maybe contain the link


> Not linking to stuff is the new normal.

Maybe in certain anti-intellectual crowds. But not here.

Reddit behavior shouldn't restrict you elsewhere.


What? Can you give some examples? I don't use reddit anymore but this sounds unbelievable to me. They ban you for providing a source?


I didn't see mention anywhere of a license. I also don't see anywhere to download this from. Is this release equivalent to saying "here is an OFL metric-compatible Arial," or are they releasing it in the sense of "our products will now look like they use Arial, but aside from that this doesn't concern you."?


It's 'available' for download here: https://www.are.na/_next/static/media/9844201f12bf51c2-s.p.w...

(but definitely don't think the license permits free use)


> I didn't see mention anywhere of a license.

This page, which is poorly designed¹ to the point that it supports the idea that this is all an in-joke rather than the work of pros, appears to suggest that this is a purely commercial work: https://abcdinamo.com/licenses

¹ Seen while scanning: (1) Scroll down, then up. Boo. (2) Leading cramped beyond "style preference". (3) Bulleted list badly styled in a way that requires work. (4) No attention paid to tracking where it's needed (e.g. small all-caps type). (5) Some terms (e.g. "First Designer") capitalized inconsistently. (6) '&' used in body copy.


Wikipedia says their traffic increased roughly 50% [1] from AI bots, which is a lot, sure, but nowhere near the amount where you'd have to rearchitect your site or something. And this checks out, if it was actually debilitating, you'd notice Wikipedia's performance degrade. It hasn't. You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2].

The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%.

The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days.

The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially make it through Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further.

1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents)

2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...

3. https://codeberg.org/forgejo/discussions/issues/319#issuecom...

4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-...


> It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs.

I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

> Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

These are some of the legitimiate problems with Anubis (and this is not the only way that you can be blocked by Anubis). Cloudflare can have similar problems, although its working is a bit different so it is not exactly the same working.


> I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

Sure... but off-topic, right? AI companies are desperate for high quality data, and unlike search scrapers, are actually not supremely time sensitive. That is to say, they don't benefit from picking up on changes seconds after they are published. They essentially take a "snapshot" and then do a training run. There is no "real-time updating" of an AI model. So they have all the time in the world to wait for a page to reach an ideal state, as well as all the incentive in the world to wait for that too. Since the data effectively gets "baked into the model" and then is static for the entire lifetime of the model, you over-index on getting the data, not getting fast, or cheap, or whatever.


Hi, main author of Anubis here. How am I meant to store state like "user passed a check" without cookies? Please advise.


If the rest of my post is accurate, that's not the actual concern, right? Since I'm not sure if the check itself is meaningful. From what is described in the documentation [1], I think the practical effect of this system is to block users running old mobile browsers or running browsers like Opera Mini in third world countries where data usage is still prohibitively expensive. Again, the off-the-shelf scraping tools [2] will be unaffected by any of this, since they're all built on top of Puppeteer, and additionally are designed to deal with the modern SPA web which is (depressingly) more or less isomorphic to a "proof-of-work".

If you are open to jumping on a call in the next week or two I'd love to discuss directly. Without going into a ton of detail, I originally started looking into this because the group I'm working with is exploring potentially funding a free CDN service for open source projects. Then this AI scraper stuff started popping up, and all of a sudden it looked like if these reports were true it might make such a project no longer economically realistic. So we started trying to collect data and concretely nail down what we'd be dealing with and what this "post-AI" traffic looks like.

As such, I think we're 100% aligned on our goals. I'm just trying to understand what's going on here since none of the second-order effects you'd expect from this sort of phenomenon seem to be present, and none of the places where we actually have direct data seem to show this taking place (and again, Cloudflare's data seems to also agree with this). But unless you already own a CDN, it's very hard to get a good sense of what's going on globally. So I am totally willing to believe this is happening, and am very incentivized to help if so.

EDIT: My email is my HN username at gmail.com if you want to schedule something.

1. https://anubis.techaro.lol/docs/design/how-anubis-works

2. https://apify.com/apify/puppeteer-scraper


Cloudflare Turnstile doesn't require cookies. It stores per-request "user passed a check" state using a query parameter. So disabling cookies will just cause you to get a challenge on every request, which is annoying but ultimately fair IMO.


Doesn't Wikipedia offer full tarballs?

This would imaginably put some downward pressure on scraper volume.


From the first paragraph in my comment:

> You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

Yes, they do. But they aren't in a rush to tell AI companies this, because again, this is not actually a super meaningful amount of traffic increase for them.


I don't think you understand the purpose of Anubis. If you did then you'd realize that running a web browser with JS enabled doesn't bypass anything.


By bypass I mean "successfully pass the challenge". Yes, I also have to sit through the Anubis interstitial pages, so I promise I know it's not being "bypassed". (I'll update the post to remove future confusion).

Do you disagree that a trivial usage of an off-the-shelf puppeteer scraper[1] has no problem doing the proof-of-work? As I mentioned in this comment [2], AI scrapers are not on some time crunch, they are happy to wait a second or two for the final content to load (there are plenty of normal pages that take longer than the Anubis proof of work does to complete), and also are unfazed by redirects. Again, these are issues you deal with normal everyday scraping. And also, do you disagree with the traffic statics from Cloudflare's site? If we're seeing anything close to that 18% increase then it would not seem to merit user-visible levels of mitigation. Even if it was 180% you wouldn't need to do this. nginx is not constantly on the verge of failing from a double digit "traffic spike".

As I mentioned in my response to the Anubis author here [3], I don't want this to be misinterpreted as a "defense of AI scrapers" or something. Our goals are aligned. The response there goes into detail that my motivation is that a project I am working on will potentially not be possible if I am wrong and this AI scraper phenomenon is as described. I have every incentive in the world to just want to get to the bottom of this. Perhaps you're right, and I still don't understand the purpose of Anubis. I want to! Because currently neither the numbers nor the mitigations seem to line up.

BTW, my same request extends to you, if you have direct experience with this issue, I'd love to jump on a call to wrap my head around this.

My email is my HN username at gmail.com if you want to reach out, I'd greatly appreciate it!

1. https://news.ycombinator.com/item?id=44944761

2. https://apify.com/apify/puppeteer-scraper

3. https://news.ycombinator.com/item?id=44944886


The fact that IP protection is expensive is essentially its defining feature. One way to think of "intellectual property" is precisely as a weird proof-of-work, since you are trying to simulate the features of physical property for abstract entities that by default behave in the exact opposite fashion.

This is the frustrating thing about getting into an argument about how "IP isn't real property" and then having the other side roll their eyes at you like you are some naive ideologue. They're missing the point of what it means for IP to not be "real property". The actual point is understanding that you are, and will be, swimming against the current of the fundamentals of these technologies forever. It is very very difficult to make a digital book or movie that can't be copied. So difficult in fact, that it we've had to keep pushing the problem lower and lower into the system, with DRM protections at the hardware level. This is essentially expensive, not just from a capital perspective, but from a "focus and complexity" burden perspective as well. Then realize that even after putting this entire system in place, an entire trade block could arbitrarily decide to stop enforcing copyright, AKA, stop fueling the expensive apparatus that is is holding up the "physical property" facade for "intellectual property". This was actually being floated as a retaliation tactic during the peak of the tariff dispute with Canada[1]. And in fact we don't even need to go that far, it has of course always been the case that patents vary in practical enforceability country to country, and copyrights (despite an attempt to unify the rules globally) are also different country to country (the earliest TinTin is public domain in the US but not in the EU).

Usually at this point someone says "It's expensive to defend physical property too! See what happens if another country takes your cruise liner". But that's precisely the point, the difficulty scales with the item. I don't regularly have my chairs sitting in Russia for them to be nationalized. The entities that have large physical footprints are also the ones most likely to have the resources defend that property. This is simply not the case with "intellectual property," which has zero natural friction in spreading across the world, and certainly doesn't correlate with the "owner's" ability to "defend" it. This is due to the fundamental contradiction that "intellectual property" tries to establish: it wants all the the zero unit-cost and distribution benefits of "ethereal goods," with all the asset-like benefits of physical goods. It wants it both ways.

Notice that all the details always get brushed away, we assume we have great patent clerks making sure only "novel inventions" get awarded patents. It assumes that patent clerks are even capable of understanding the patent in question (they're not, the vast majority are new grads [2]). We assume the copyright office is property staffed (it isn't [3]) We assume the intricacies of abstract items like "APIs" can be property understood by both judge and jury in order to reach the right verdict in the theoretically obvious cases (also turns out that most people are not familiar with these concepts).

How could this not be expensive? You essentially need to create "property lore" in every case that is tried. Any wish for the system to be faster would necessarily also mean less correct verdicts. There's no magic "intellectual property dude" that could resolve all this stuff. Copyright law says that math can't be copyrighted, yet we can copyright code. Patent law says life can't be patented, yet our system plainly allows copyrighting bacteria. Why? Because a lawyer held of a tube of clear liquid and said "does this seem like life to you?" The landmark Supreme Court case was decided 5-4 [4], and all of a sudden a thing that should obviously not be copyrightable by anyone that understands the science was decided it was. There's no "hidden true rules" that if just followed, would make this system efficient. It is, by design, a system that makes things up as it goes along.

As mentioned in other comments, at best you could just flip burden to the other party, which doesn't make the system less expensive, it just shifts the default party that has to initially burden the cost. Arguably this is basically what we have with patents. Patents are incredibly "inventor friendly". You can get your perpetual motion machine patented easy-peasy. In fact, there is so much "respect" for "ideas" as "real things", that you can patent things you never made and have no intention of making. You can then sue companies that actually make the thing you "described first". Every case is a new baby being presented to King Solomon to cut in half.

In other words, an inexpensive system would at minimum require universal understanding and agreement on supremely intricate technical details of every field it aims to serve, which isn't just implausible, it is arguably impossible by definition since the whole point of intellectual property is to cover the newest developments in the field.

1. https://www.cigionline.org/articles/canada-can-fight-us-tari...

2. https://tolmasky.com/2012/08/29/patents-and-juries/

3. https://www.wired.com/story/us-copyright-office-chaos-doge/

4. https://supreme.justia.com/cases/federal/us/447/303/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: