Anna! I just want to say, I love you. Everything about what you’re doing is heroic. Whoever and wherever you are, thank you.
Please focus on your opsec. The more visible you become, the progressively angrier people will get. Don’t do anything silly like edit your Wikipedia page from your house.
With that out of the way, someone I know happens to have the original books3 epub files. I think they can be convinced to send them to you. It’s only 200,000 books, but that could theoretically grow your collection by 10% or so. I don’t know whether that would be helpful to you (you’ve far surpassed books3 at this point), but if so, let me know.
Given the legal risks, the best course of action for AI companies is probably to ignore English and European books entirely. There is plenty of Chinese data, and the models would learn all the same concepts without exposing anyone to lawsuits.
Since you're pretty knowledgeable about these things, I think I should ask here: I've made a fairly simple design for a program based on BitTorrent, that will allow people to "donate" their disk space to organizations like archive.org, Anna's Archive, and anything else that needs data hosted.
Basically, you download a client, say "allocate 2 TB of my disks to whatever archive.org/donate/disk.rss" says, and the server/client combination ensures you download and seed the rarest 2TB of the collection.
This design is also open, in the sense that the server can share the database of torrents it contains, and anyone can use it to fetch any of the files in the dataset from the swarm.
Would something like this be at all useful? I've emailed a few archivists, but I got no response, and the one person I've managed to talk to about this said there have been a few attempts on this, but they always fail for one reason or another.
You are literally building what I’ve been slowly working towards on my own. This seems like a very good sign. Multiple simultaneous discovery is a common occurrence in the sciences.
The hard part is that those who donate their space have authority over that space. It’s the Byzantine fault tolerance problem: imagine if 4chan donated their space, then started serving CSAM instead of the expected data. You can use hashes to verify integrity, but then the question becomes who gets to decide which hashes are ok. And hashing makes it impossible to edit large files, which is a frequent occurrence in LLM work. You’re constantly tweaking your datasets and spitting out new blobs.
Direct answer: yes, you’re doing good work, and you should keep doing it. I would personally use this for storing books3 transformations.
The other hard part is that you’ll want at least some redundancy — see 6.824 distributed systems, or the GFS paper. It’s why I’ve been implementing Raft and toying with some kind of distributed consensus without a blockchain. (Such consensus is still possible if the researchers were granted authority over what can be stored — which is the whole reason people are donating their disk space in the first place.)
Another issue is sudden bandwidth loss. Data storage is one part of it. The other half is rapid transfer. By replicating the data, you can pull it from multiple replicas at once (I.e. there are more seeders). This also protects against someone suddenly getting throttled, or just having a power outage. The protocol should prioritize donors with high bandwidth over vast storage space.
Feel free to DM me on Twitter if you’d like to toss around some design ideas more seriously, and thank you for trying to build this.
> The protocol should prioritize donors with high bandwidth over vast storage space.
If you're doing this in bittorrent then you might want a client that's configured to optimize for a different goal than most torrent clients.
Potential goals, somewhat conflicting:
A) keep data with a low mirroring degree available. either this needs to be centrally coordinated or some sort of randomized algorithm where clients pick underseeded torrents but not everyone picks the same
B) bandwidth matching. to not consume more resources than are provided a client maybe should only download 1 piece of data for every N times it uploaded any piece. This is much less greedy than what you'd have in a normal torrent client but ensures that caches themselves don't take up much bandwidth compared to users who actually want to download something. Otherwise a misconfigured cache (e.g. behind NAT) could accidentally always download data without ever giving much back.
Thank you, that’s helpful to know. And frustrating. I see why Twitter did that, because bots, but I was willing to wade through the crap to find the gems. Which do get sent.
I responded with some telegram info if that helps.
Not exactly, IPFS doesn't tell you what to download (you select what to download) and thus can't push the rarest material to you. There are many similarities, but this is much more suited to making large datasets more resilient/accessible.
Did some rabbit hole spelunking on IPFS yesterday...
One could easily use IPFS as the storage layer and add $magic_sauce to manage the distribution of the books within the sub-network kind of like git-annex does to manage people's porn collections. There's one project I saw that does this (using the IPFS-daemon's RPC API) to run a cluster of IPFS nodes for whatever reasons people would want to do something like this.
Yes please, put us in touch by email. Or feel free to email me yourself and we can set up more secure comms from there. Thanks so much for everything you are doing as well!
Please focus on your opsec. The more visible you become, the progressively angrier people will get. Don’t do anything silly like edit your Wikipedia page from your house.
With that out of the way, someone I know happens to have the original books3 epub files. I think they can be convinced to send them to you. It’s only 200,000 books, but that could theoretically grow your collection by 10% or so. I don’t know whether that would be helpful to you (you’ve far surpassed books3 at this point), but if so, let me know.
Given the legal risks, the best course of action for AI companies is probably to ignore English and European books entirely. There is plenty of Chinese data, and the models would learn all the same concepts without exposing anyone to lawsuits.