Hacker News new | past | comments | ask | show | jobs | submit login

My impression was that peers are generating the deltas on-the-fly based on which commits the requesting peer states it needs. The problem's therefore shoved onto git itself, with the seeder just cherry-picking a specific range of commits from its own copy of the repo and bundling them together.



Ah, okay, but that would imply that GitTorrent doesn't make any use of the swarming capability which makes BitTorrent special.

Not using swarming brings back all of the old problems of NAT traversal, asymmetric upload/download bandwidth, throttling, censorship etc.


Not quite: the peer generates the pack and tells you its hash, and then you query the network for anyone who has that hash (them, for starters), and perform a swarming download of it. So git clones of popular repositories would usually swarm.


The probability of swarming would be influenced by multiple factors, eg

* Higher popularity => More peers => Higher probability that multiple peers want the same packfiles.

* Higher popularity => More commits => More permutations of packfiles => Lower probability that multiple peers want the same packfiles (and stronger trends toward small/inefficient packfiles).

* More frequent synchronizations (peers always online) => More immediacy => Smaller packfiles => Higher probability that multiple peers want the same packfiles.

* Less frequent synchronizations (peers go offline regularly) => Less immediacy => Bigger packfiles => Lower probability that multiple peers want the same packfiles.

It would be really interesting to see how these competing pressures play-out (either by doing some math or randomized experiments).

If the main goal here is strictly decentralization (without concern for performance or availability[F1]), then one might look at swarming as a nice-to-have behavior which only happens in some favorable circumstances. However, by latching onto the "torrent" brand, I think you setup some expectations for swarming/performance/availability.

([F1] Availability: If Seed-1 recommends a certain packfile, then the only peer which is guaranteed to have that packfile is Seed-1 -- even if there are many seeds with a full git history. If Seed-1 goes offline while transmitting that packfile, how could a leech continue the download from Seed-2? The #seeds wouldn't intuitively describe the reliability of the swarm... unless one adds some special-case logic to recover from unresolvable packfiles.)

---

Could this be mitigated with some constraints on how peers delineate packfiles?


> Could this be mitigated with some constraints on how peers delineate packfiles?

YAGNI.

Like so many here, you have a single view of how bittorrent should be used, based on current filesharing practices, so you believe we need to map gittorrent to filesharing and have those packfiles be as static as possible in order to be shared at large.

You need to go back to the root of the problem, which is simple: there is a resource you're interested in, and instead of getting this resource from a single machine and clog their DSL line, you want to get this resource from as many machines as possible to make better use of the network.

How does gittorrent work ?

- The project owner commits and updates a special key in the DHT that says "for this repo, HEAD is currently at 5fbfea8de70ddc686dafdd24b690893f98eb9475"

- You're interested in said repo, so you query the DHT and you know that HEAD is at 5fbfea8de70ddc686dafdd24b690893f98eb9475

- Now you ask each peer who have 5fbfea8de70ddc686dafdd24b690893f98eb9475 for their content

- Each peer builds the diff packfile and sends it through bittorrent. Technically it's another swarm with another infohash, but you don't care; it's only ephemeral anyway. The real swarm is 5fbfea8de70ddc686dafdd24b690893f98eb9475.

Because of this, higher popularity will mean more peers in the swarm, whatever the actual packfile to be exchanged is. Bittorrent the way you know it is not used as-is, because there is information specific to gittorrent that helps make a better use of it.


(Author here.)

Great comment, thank you. But I think the infohash should actually be shared, packfiles are pretty deterministic in practice. So you'd be getting the diff packfile from the person who just made it, and anyone else who already did.

(If I find packfile generation to not be deterministic enough, I think I'll switch to using a custom packfile generation that is always deterministic.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: