Even working at Google, my jaw still dropped at this:
Google's codebase is shared by more than 25,000 Google software
developers from dozens of offices in countries around the world.
On a typical workday, they commit 16,000 changes to the codebase,
and another 24,000 changes are committed by automated systems.
Each day the repository serves billions of file read requests,
with approximately 800,000 queries per second during peak traffic
and an average of approximately 500,000 queries per second each
workday. Most of this traffic originates from Google's
distributed build-and-test systems.
Mostly I felt terribly unproductive — my changelist generation rate is way below average.
Can you please quote with ">" at the start of the paragraph instead of using a code snippet? This is unreadable on mobile; the text is 3x as wide as the (scrolling) viewport.
The number of engineers and commits is probably in the same ballpark at Amazon. With a DVCS, far less operations are done server side, so I would imagine the read requests are an order of magnitude or two lower. They have tooling to get a global view across all repos, however there are no cross-repo atomic operations (those can be managed in their build system in a relatively robust way).
What advantage does a company wide mono repo provide?
The ability to have something like CitC (or this post's git virtual filesystem) is certainly one big advantage -- no need to clone new packages, they're right there in your "local" source tree. Bazel (blaze) is another, particularly when coupled with working at HEAD.
My experience with farms of git repos is that the lack of atomic operations over many tiny repos leads to things like version sets and having to periodically merge dependencies. I've worked on teams where that was inevitably neglected during hectic periods resulting in painful merges of large numbers of changes. That problem simply doesn't exist with working at HEAD and high quality presubmit test automation/admission control. The single repo also allows for single code reviews spanning multiple packages which makes it MUCH simpler to re-arrange code (Bazel again helps here since a "package" is any directory with a BUILD file). Package creation is lighter weight for the same reason, and has fewer consequences for poor name choices since rearrangement is easy and well supported by automated tools.
Sharing one build system where a build command implicitly spans many packages also results in efficient caching of build artifacts and massively distributed builds (think a distributable and cacheable build action per executable command rather than a brazil-build per package). Each unit test result can be cached and only dependent tests re-run as you tweak an in-progress change. This is fantastic for a local workflow (flaky tests can be tackled with --runs_per_test=1000, which with a distributed build system is often only marginally slower than a single test run). Also, you can query all affected tests for a given change with a single "local" bazel query command. The list goes on from here -- I keep thinking of new things to add (finer grained dependencies, finer grained visibility controls, etc.).
It's not that you can't build most of this for distributed repos, but I'd argue it's harder and some things (like ease of code reorg) are nearly impossible.
Subjectively, having worked with both approaches at scale, Google's seems to result in much better code and repo hygiene.
Unfortunately, it's not available externally.
Even working at Google, my jaw still dropped at this:
Mostly I felt terribly unproductive — my changelist generation rate is way below average.