The post raises several points that I wholeheartedly agree with, but the framing is poor and honestly kind of elitist (or just short-sighted). Maybe to the point that I think much of it might just be bait, lol. For example:
> Ask a twenty-two-year-old to connect to a remote server via SSH. Ask them to explain what DNS is at a conceptual level. Ask them to tell you the difference between their router’s public IP and the local IP of their laptop. Ask them to open a terminal and list the contents of a directory. These are not advanced topics. Twenty years ago these were things you learned in the first week of any serious engagement with computers.
What? Computers were everywhere in all kinds of domains by 2006, but you can bet that your average accountant of the time would most likely not be able to SSH into a server (nor should they need to...) I guess it really depends on what the author qualifies as a "serious engagement with computers."
They"ve basically got the dates pretty wrong. It's make sense if they'd said 35 years ago, that's when it was common to know that.
I'd say almost all of that became redundant for the average person with windows 3.1 release (34 years ago) or, maybe, more windows 95 (31 years ago).
I remember desperately trying to get two computers to talk to each other so we could play doom in the early 90s, whatever black magic we had to do seemed to take hours to get working.
The time we had 3 or even 4 computers playing Baldurs Gate together I swear we started trying to get the computers talking at 7pm and didn't start playing till 10 (but it was amazing).
It obviously depends on local laws, but it's very commonly illegal to sell prepared food without a license/permit. You might not get caught selling food on FB Marketplace, but that doesn't make it any less allowed.
I agree with the author regarding Apple's walled-garden app distribution, but the analogy just doesn't work here.
I'm interested in how the poison data was generated and why it's "practically endless". It looks like bits of code, structured data, and prose, but with small modifications that make it subtly incorrect. Usually off-by-a-few numbers, e.g. I got the text of GPL-3.0 with a copyright date of 2738.
Who says you need to pipe the entire document with JSON-LD directly into the context window? I agree, that is very wasteful. You can just parse the relevant bits out and convert the JSON-LD data into something like your txt format before presenting it to the LLM. Bake that right into whatever tool it uses to scrape websites.
That solves the Token Tax. It fails the Bandwidth Tax.
To get that JSON-LD, you still download 2MB of HTML. You execute JS. You parse the DOM.
You are buying a haystack to find a needle, then cleaning the needle. We propose serving just the needle.
Furthermore, JSON-LD is strictly for facts. It cannot express @SEMANTIC_LOGIC. It lacks the instructions on how to sell.
Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.
wait but then why bother with this PoW system at all? if they're just trying to block anyone without JS that's way easier and doesn't require slowing things down for end users on old devices.
reminds of how wikipedia literally has all the data available even in a nice format just for scrapers (I think) and even THEN, there are some scrapers which still scraped wikipedia and actually made wikipedia lose some money so much that I am pretty sure that some official statement had to be made or they disclosed about it without official statement.
Even then, man I feel like you yourself can save on so many resources (both yours) and (wikipedia) if scrapers had the sense to not scrape wikipedia and instead follow wikipedia's rules
I don't think that's the threat model here. The concern is regarding potentially sensitive information being sent to a third-party system without being able to audit which information is actually sent or what is done with it.
So, for example, if your local `.env` is inadvertently sent to Cursor and it's persisted on their end (which you can't verify one way or the other), an attacker targeting Cursor's infrastructure could potentially compromise it.
The forks aren't actually automatically taken down in most cases. The claimant must list every individual fork in the claim. Which I love, because it's kind of petty but still following the DMCA to the letter.
Here is an example[1] of the form claimants must fill out.
> Each fork is a distinct repository and must be identified separately if you believe it is infringing and wish to have it taken down
IIRC it took them a couple months to get through all of the Yuzu forks after the initial DMCA and lawsuit. I doubt there were nearly as many forks of Ryujinx, though.
> Ask a twenty-two-year-old to connect to a remote server via SSH. Ask them to explain what DNS is at a conceptual level. Ask them to tell you the difference between their router’s public IP and the local IP of their laptop. Ask them to open a terminal and list the contents of a directory. These are not advanced topics. Twenty years ago these were things you learned in the first week of any serious engagement with computers.
What? Computers were everywhere in all kinds of domains by 2006, but you can bet that your average accountant of the time would most likely not be able to SSH into a server (nor should they need to...) I guess it really depends on what the author qualifies as a "serious engagement with computers."
reply