Pretty easy explanation: you've bumped into a few of the ~dozen people who are crawling hidden services for research or law enforcement purposes.
When you publish your HSDir they'll come and crawl, and chances are none of them were expecting a 50PB archive.org mirror and just got stuck.
It's likely that once the operators of each crawler realized this HS was an archive.org mirror they stopped the crawls.
The early version of a crawler I ran across hidden services would have tripped up in exactly this way[0]
Everything else in this post is either a misunderstanding of Tor[1] or plain paranoia.
[1] the top exit nodes have little to do with who is crawling or attacking a hidden service, France and Germany feature heavily in nodes because of the many cheap Tor-friendly hosts, there is nothing 'unusual' about unnamed nodes and the AS confusion is just someone doing a good job of staying anonymous - thanks for reporting them
I'd say not holding to the standards of robots.txt and 403-Forbidden is quite malicious, just not evil or bad. If you build a crawler, you should play nice. But bot A-D were easily discouraged.
Eddie however is another problem. It overloads the network, doesn't crawl and doesn't parse the responses. This is not crawler behaviour...
The rest of the post is solid inductive reasoning (from my perspective): the bot is identifiable by his behaviour. It has a faster response time that a source-relay-source roundtrip. Thus the bot must originate there.
This is supported that the anonymous relays were set up just before the attack, all at the same time. And after the attack stopped the majority of all traffic through the relays stopped.
There are also ways to keep your registration private without resorting to fraud. Though probably a number of people think of this as the 'easy' solution.
> I'd say not holding to the standards of robots.txt and 403-Forbidden is quite malicious
Most hidden services don't publish robots files. The only ones that do are the proxy services (which are hidden services but not usually 'hidden'). The purpose of the proxying is to find, discover and monitor what are usually illegal or malicious services.
I don't think there are legitimate crawlers on hidden services - there are a couple of drug market search engines but they identify themselves outside of robots.txt
It's really difficult to run a large-scale hidden service because of this - you need to be able to throttle or block connections but not based on the inbound circuit. You also need to setup guards (which OP makes no mention of)
> It overloads the network, doesn't crawl and doesn't parse the responses.
It's likely adding those later responses into a crawl queue that is tens of thousands of URLs long.
Overloading the network is unintentional, usually your crawling is throttled by your circuit.
> I'd say not holding to the standards of robots.txt and 403-Forbidden is quite malicious, just not evil or bad. If you build a crawler, you should play nice.
> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.
When you publish your HSDir they'll come and crawl, and chances are none of them were expecting a 50PB archive.org mirror and just got stuck.
It's likely that once the operators of each crawler realized this HS was an archive.org mirror they stopped the crawls.
The early version of a crawler I ran across hidden services would have tripped up in exactly this way[0]
Everything else in this post is either a misunderstanding of Tor[1] or plain paranoia.
[1] the top exit nodes have little to do with who is crawling or attacking a hidden service, France and Germany feature heavily in nodes because of the many cheap Tor-friendly hosts, there is nothing 'unusual' about unnamed nodes and the AS confusion is just someone doing a good job of staying anonymous - thanks for reporting them