Interesting. IIRC, libgen used IPFS for preservation efforts. Anna's Archive (se...

pilimi_anna · 2024-06-14T00:21:30 1718324490

Anna here.

Libgen still uses torrents primarily for preservation. It also hosts on IPFS but that is more for access, and there are very few IPFS seeders.

We tried IPFS for a bit but found it not stable and usable enough for preservation purposes. We're closely watching IPFS development and hope that it will get there, since it would be wonderful to merge the preservation and access use cases in one system.

r3trohack3r · 2024-06-14T02:17:12 1718331432

That's wild (in a good/interesting way) to me.

I've found the BitTorrent protocol tries to be more suited to accessing popular data on-demand (i.e. streaming a popular file) vs. archival.

IPFS' BitSwap protocol strikes me as trying to be optimized for longer-term preservation (higher latency time to first byte in exchange for more resilient pinning/discovery/propagation of rare data).

It's cool you're observing the opposite. I've had a growing suspicion that both protocols haven't quite realized the benefits they were hoping to get from the trade-offs they made in their transfer/discovery protocols.

Would love to compare notes at some point if you'd be open to it.

We've been playing around with both BitTorrent and IPFS. Some of the datasets we are working towards supporting are approaching the scale you work at (100TB archives).

Ultimately both BitTorrent and IPFS have fallen short for me when trying to seed 100TB datasets.

I've got a hunch that we're going to need to roll a new protocol to tackle these larger datasets that merges some of HTTP's, BitTorrent's, and IPFS' approaches to sharing content.

I have personal R&D list for pushing a file sharing protocol past the 100TB limit (not in any particular order):

* Better chunking using a mix of:

  * Rolling hashes

  * File boundary splitting

  * (should enable deduplication of identical files across nonhomogeneous archives, and allow for adding content to an archive with without losing the existing seeders)

  * (inspired by prior art in container storage: https://github.com/hinshun/ipcs)

* "online" deterministic archive formats w/ detached metadata

  * Ability to share a directory as an archive, or partial slice of an archive, without having to generate the archive on disk. (Announce a "tarball" like archive on the DHT without having to generate it by being able to generate the "chunks" on demand from the directory)

  * Detatch the manifest containing the archive's contents from the archive, so you can download/parse the manifest without downloading the full dataset. (You can use this to find the chunks specific files are in. So you can download a single file from a 1TB archive, and the client can seed that file back to the network as part of the archive.)

  * Chunking of manifest files for large datasets, since the manifest itself might grow to many GBs in size (manifest resolution inspired by IPLD's data structure)

  * Normalize file metadata in the archive header so timestamps etc. don't muck up your CIDs

  * Deterministic ordering of files in an archive

* Chunking/Transfers/Announcing/Discovering

  * Supporting increasing the chunk sizes for large files past 1MB. A 100TB dataset w/ 1MB chunks requires ~209M CIDs just for the chunks, that's a lot of load on the DHT and a lot of work on the seeding node to keep the data available.

  * Support interruptible/resumable/recoverable downloads from peers using something similar HTTP RANGE header semantics

  * Merge BitTorrent's DHT query approach w/ IPFS' DHT query approach, asking connected peers for CIDs and tit-for-tat reciprocity while simultaneously hedging your bet by kicking off the slower DHT traversal to find more peers

* Connectivity

  * Bringing mobile devices and browser tabs into the fold as first class peers that can both download and seed content

  * (i.e. WebRTC: https://github.com/libp2p/rust-libp2p/tree/master/examples/browser-webrtc)

  * (proof-of-concept NAT hole punching appliance for end-users: https://github.com/retrohacker/turn-it-up)

Thank you for everything you do

dtx1 · 2024-06-13T17:51:27 1718301087

I suspect in part it's the required capacity. This Project is far beyond what even the largest private trackers could host but if anyone comes even close to be able to keep this alive when the copyright mafia comes it's the torrent community.

tux3 · 2024-06-13T17:58:00 1718301480

Nexus - another very large archive - is using IPFS. But in my experience Bittorrent works a lot better at this scale. The IPFS UX is full of papercuts, when it isn't outright bugging out or crumbling under the size of the dataset.

boramalper · 2024-06-14T04:34:36 1718339676

IIRC, Nexus is using Iroh[0] instead:

> Starting with v0.3.0, Iroh is a ground-up reimagination of the InterPlanetary File System (IPFS) focused on performance.

Also see, A New Direction for Iroh[1].

[0] https://www.iroh.computer/docs/

[1] https://www.n0.computer/blog/a-new-direction-for-iroh/

ffk · 2024-06-13T18:07:02 1718302022

I'm guessing the decision comes down to ease of use for people to participate in mirroring. My underestanding is IPFS tends to require more infrastructure, and still requires someone to pin the data.

Many bittorrent clients let you click a button to continue seeding the data over time.