I think once you're checked out, you're still distributed as far as being able to make and push commits which modify the files you have. But obviously someone can't clone files from you which you don't yourself have.
From reading the first post "The largest Git repo on the planet" - "Windows code base" of "3.5M files", "repo of about 300GB", "4,000 engineers" - I assumed that "Windows code base" contains all the utilties, DLLs that included in windows release such as notepad, games, IE/Edge etc and not just the Windows KERNEL.
If so, the comparable code base to check against is Android AOSP, Ubuntu, RedHat or FreeBSD.
If that is true, I believe the source code base for Ubuntu/RedHat distribution with all the Apps likely be bigger compare to Windows in term of number files, source repo and number of engineers (open source developers for all the packages such as ff, chrome openoffice.)
Microsoft folks feel free to correct me here.
It seems that the existing git's process, dev model seems to work well for much bigger projects already by using different git repo for each apps.
Still not sure what pain point does the new GVFS solve.....
AOSP is probably the closest comparison, Fedora/RHEL and Ubuntu don't keep application source code checked into git. Fedora specifically uses dist-git, there's a git repo per package with the spec file, patches and other files needed for the build and then source tarballs are pushed to a web server where they can be downloaded later with the dist-git tooling.
So yeah, all of the code and data actually stored in source control for various Linux distributions is pretty small.
Linux distributions are a different beast since they're mostly tracking a myriad of upstream third-party repositories overlaid when necessary by their own patch sets. And most of those upstream repos are by necessity highly decoupled from any particular distribution. Different distributions have different solutions to how they manage upstreams with local changes. E.g. OpenEmbedded uses BitBake where (similar to FreeBSD Ports and Gentoo Portage) the upstream can be pretty much anything, including just a tarball over HTTP, and local changes are captured in patch files, while others instead use one-to-one tracking repos where local changes are represented by version control revisions.
AOSP is closer to what you're imagining, but I haven't met anyone who thinks Repo (Android's meta-repo layer) and Gerrit (their Repo-aware code review and merge queue tool) are pleasant to work with. E.g. it takes forever and a day to do a Repo sync on a fresh machine. A demand-synced VFS would be very nice for AOSP development, even though it's not a monorepo but a polyrepo where Repo ties everything together.
It's more that it provides more control on the centralized<->distributed spectrum. Git by default, yes, is fully distributed with every copy of a repo supposed to maintain full history, etc. GVFS allows you the option to offload some of that storage effort to servers you specify. Those servers can be distributed themselves (similar to CDNs, etc), so there's still distribution flexibility there.
You can think of it as giving you somewhat flexible control of the "torrenter/seeder ratio" in BitTorrent: how many complete copies of the repo are available/accessible at a given time.
I've been thinking of it as more of making it on-demand — it doesn't sound like you lose the underlying distributed nature (e.g. from the description on https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g... it sounds like 'git push ssh://github.com/microsoft/windows.git' would work, although it might take longer to run than it does for you to be fired) but managed clients can choose to operate in a mode where they download everything on-demand rather than immediately.
That seems like a reasonable compromise for a workload which Git is otherwise completely unable to handle.
It’s just a pragmatic recognition that most files are going to stay unchanged relative to some “master” repository anyway. It maintains the same distributed semantics as vanilla git, but allows you to use less disk space if you choose to rely on the availability of that master repository.
It's worth asking: how many Github users are actually using git as a fully distributed version control system? The typical Github workflow is to treat the Github repo as a preferred upstream--which sort of centralizes things.
I think a common open source workflow is to have a forked GitHub repo and working off of this.
So in the end you have your local repo, your fork, and the organization repo. During a pull request process third parties might make pull requests to your fork to try and fix things
The typical Github workflow is to treat the Github repo as a preferred upstream
Is it typical? What metrics do you have to support that?
Speaking for myself I have only ever used triangular workflows: fork upstream; set local remotes to own fork; push to own fork; issue pull request; profit
The main repo is upstream of you regardless of whether it is labeled as such in `git remote -v`. If it goes offline, nobody systematically falls back on your fork. This makes the system effectively centralized, which is parent's point.
To elaborate in more (possibly excruciating) detail:
It's very rare for anyone on GitHub to do the sort of tiered collaboration that the Linux kernel uses. If, say, I want to contribute to VSCode, pretty much the only way to get my changes upstream is to submit pull requests directly to github.com/Microsoft/vscode.
Compare to the tiered approach, where I notice someone is an active and "trusted" contributor, so I submit my changes to their fork, they accept them, and then those changes eventually make their way into the canonical repo at some future merge point. That's virtually unheard of on GitHub, but it's the way the Linux kernel works.
Pretty much the only way you could get away with something even remotely similar and not have people look at you funny in the GitHub community is maybe if you stalked someone's fork, noticed they were working on a certain feature, then noticed there was some bug or deficiency in their topic branch, then they wake up in the morning to a request for review from you regarding the fix. Even that, which would be very unusual, would really only work in a limited set of cases where you're collaborating on something they've already undertaken—there's not really a clean way in the social climate surrounding GitHub to submit your own, unrelated work to that person (e.g., because it's something you think they'd be interested in), get them to pull from you, and then get upstream to eventually pull from that person.
Good clarification. But, do we know that is what everyone does? It's obviously a cultural, rather than a technical limitation. I suspect there are significant bodies of code kept inside corporate forks of upstream (and regularly rebased to them) with only selected parts dribbled out to the public upstream repos by trusted representatives of said copies. But, I have nothing to prove that and the only public traces I would see would be commits to upstream from corporation X.
Depends on what you mean by distributed. If a repo contains 2 projects pA and pB and I’m only involved in the former while the other project has a 2GB binary asset with 1000 historical revisions, then I’m happy to just be distributed wrt my part of the repo.
To put it another way: A server with multiple repos on it is a central server with respect to the repos you don’t clone! This is the same, but for parts of repos.