Hacker News new | past | comments | ask | show | jobs | submit login
Monorepos: Please don’t (medium.com/mattklein123)
332 points by louis-paul on Jan 3, 2019 | hide | past | favorite | 391 comments



My advice is that if components need to release together, then they ought to be in the same repo. I'd probably go further and say that if you just think components might need to release together then they should go in the same repo, because you can in fact pretty easily manage projects with different release schedules from the same repo if you really need to.

On the other hand if you've got a whole bunch of components in different repos which need to release together it suddenly becomes a real pain.

If you've got components that will never need to release together, then of course you can stick them in different repositories. But if you do this and you want to share common code between the repositories then you will need to manage that code with some sort of robust versioning system, and robust versioning systems are hard. Only do something like that when the value is high enough to justify the overhead. If you're in a startup, chances are very good that the value is not high enough.

As a final observation, you can split big repositories into smaller ones quite easily (in Git anyway) but sticking small repositories together into a bigger one is a lot harder. So start out with a monorepo and only split smaller repositories out when it's clear that it really makes sense.


Components might need to be released “together”, but if they are worked on by different teams, it means they’ll have a different release process, as in different timeline, different priorities.

First of all this is normal, because otherwise the development doesn’t scale.

In such a case the monorepo starts to suck. And that’s the problem with your philosophy ... it matters less how the components connect, it matters more who is working on it.

Truth of the matter is that the monorepo encourages shortcuts. You’d think that the monorepo saves you from incompatibilities, but it does so at the expense of tight coupling.

In my experience people miss the forest from the trees here. If breaking compatibility between components is the problem, one obvious solution is to no longer break compatibility.

And another issue is one of responsibility. Having different teams working on different components in different repos will lead to an interesting effect ... nobody wants to own more than they have to, so teams will defend their components against unneeded complexity.

And no, you cannot split a monorepo into a polyrepo easily. Been there, done that. The reason is that working in a monorepo versus multiple repos influences the architecture quite a lot and the monorepo leads to very unclear boundaries.


> Components might need to be released “together”, but if they are worked on by different teams, it means they’ll have a different release process, as in different timeline, different priorities.

released "together" == part of the same feature. Timelines, release process and team priorities are all there to help to deliver features. If they stand in the way, they need to be adjusted. Not the other way around.

Multi repos encourage silos. Silos encourage focusing on the goals of the silo and discourage poking around the bigger picture. Couple that with scrum, that conveniently substitute real progress metrics with meaningless points, and soon enough you end up with an IT department, full on with processes but light on delivering value.


And no, you cannot split a monorepo into a polyrepo easily. Been there, done that. _The reason is that working in a monorepo versus multiple repos influences the architecture quite a lot and the monorepo leads to very unclear boundaries.

I think you are conflating a monorepo (where boundaries can still be established, e.g. via a module isolation mechanism specific to the stack used) with a "monoproject"/"monomodule", where is no modularization at all.

Edit: expanded wording


No, there's no such confusion.

> where boundaries can still be established, e.g. via a module isolation mechanism specific to the stack used

Unfortunately this isn't a technical issue and that's the problem.


If the projects within the monorepo are decoupled and have clear boundaries then why not have them in separate repositories?...

In my opinion monorepos make refactoring dependant projects much easier. However it is much harder to establish and enforce clear boundaries...


With monorepos you don't have to manage PRs for 8 different repositories when adding a feature.

In my experience it's hard to establish clear boundaries, regardless of repository kind. It may be more difficult to create features which are tightly coupled across multiple repositories, but people do it regularly. And when they do, you suddenly have to manage and maintain synced features across multiple repositories.

In fact, the repo tool for the android project makes it quite easy to develop features across repositories, thus lowering the boundaries significantly.


i have a monorepo that contains a few different early stage frontend web projects that does not interact with each other at all. They do however uses a shared component library that is also placed inside the monorepo. Tools like yarn workspaces makes sharing the library easy if the projects are located on the same repo.

When I change something on the library, i could easily also run tests across all the projects that depends on it with the latest changes of the library and make sure that my change is not breaking things all over the place, which is also pretty nice.

I am not sure yet if using a monorepo is actually the best way deal with this kind of projects, but for now it feels better than having them on seperate repos and then having to deal with the complexity of sharing the library across repos by publishing it somewhere or using git submodules or something.


I work on a project structured into microservices and use both. There is one global repo with submodules in subrepositories.

So when someone only wants a submodule they can happily only clone that, but when someone wants all stuff (which is the default case), the can clone and install all at once.

Downside is that I have to commit twice


> Having different teams working on different components in different repos will lead to an interesting effect ... nobody wants to own more than they have to

... and so nobody really understands how all of the components tie together and as a result it takes weeks of manual testing to release.


My rule of thumb is: if you need to do PRs in several repositories to do one features, you should probably merge the repositories. At work, we have code spread among a bunch of repositories, and having to link to the 2/3 related PRs in other repos is a major PITA, and even more so for the reviewers.


My rule of thumb is: if you need to do PRs in several repositories to do one feature, your projects are either tightly coupled enough that they should be one monolithic piece of software, or your tight coupling is a problem you should work on resolving.

Requiring multiple PRs to multiple repos to roll out one user-facing feature is fine, as long as your independent modules/projects are not actually interdependent (i.e. one of those PRs will not break another independent repo that lacks a corresponding PR).


Sometimes a feature needs to change a shared dependency library.

But in that case you could consider the change to the dependency a single release. And ingesting it into another app a separate release.


At a past job, I had to edit roughly 5 different repositories in order to do some trivial programming task (send an email or some such). It was quite easily the least productive / most demoralizing workflow I've ever experienced.

Context switching really sucks. You should aim to reasonably avoid it


Sending an email can have a few different responsibilities:

Who is the email being sent to?

What is the content of the email?

What data does the email content and recipient depend on?

What are you tracking on the email?

How is the email visually formatted?

All those things might be in different apps as the logic gets more complicated.


Don't mix up downsides of multirepo and bad composition of your microservices


Just because things change in tandem, that does not mean that they're all the same thing. When I add a new function to my backend service, all frontends that consume its API also need to be adjusted. But that doesn't mean that the backend service, its command-line clients and its web GUI client should live in the same repo.


It's probably a matter of taste - but I think they should be in the same repo. I like tying test failures/regressions to a specific commit for documentation and admin purposes. Having a test fail or regression due to an 'unrelated' commit in another repo sounds like a nightmare waiting to happen when you try investigating.

I the difference of opinion is between developers who work on self-hosted "evergreen" products where the latest version is deployed, and others who work with multiple release branches with fixes/features constantly being cherry-picked.


Why? You are just creating more work for yourself by keeping the components in different repos. Now you need to create N commits when updating something. If your future self wants to investigate how the software has evolved there are N times as many commits to analyze.


I really think it should if possible. Makes life much easier in my experience.


Not always. It makes absolutely sense to have a repository for the gui and one for the server. When writing a new feature you usually write some gui code and some server code and create different pull requests. I think monorepos are seriously wrong and I completely agree with this article.


Well... Why does that make sense? I have a repository containing both the GUI and the server, and sometimes I have to make changes to both. Locating those related changes together in the same commit and/or PR makes a lot of sense to me: the changes depend on each other, and thus should be reviewed together. What's the advantage of splitting them up?


Because obviously the changes that you make in the gui are completely isolated from the changes you make on the server. When you are working on the gui the server code is just noise and vice versa. And it gets even worse when you use two different languages for the gui and the server.


> Because obviously the changes that you make in the gui are completely isolated from the changes you make on the server.

In my experience, that is almost never the case. Often, the frontend requires a new endpoint or a modification to an existing endpoint. If you don't coordinate this change, you end up with a non-functional PR that cannot even be tested. Same happens when the backend proposes an endpoint change that affects the frontend.

We have moved the frontend and backend to the same repo to make coordination and testing of such cases simpler.


You make the endpoint first, and test it without the UI. What challenges do you foresee here?


* Changing graphql schemas.

* Any non-backwards compatible change in the interface between the components. Yes this can be solved. But when working in a smaller team on proprietary software why use time solving a problem you don't need to solve?

(This is from experience.)


> why use time solving a problem you don't need to solve?

Unless they're running on the same computer and deploy literally simultaneously, this is already a problem you need to solve.


A surprising number of companies are prepared to accept an hour of downtime for an internal system if it saves them money.. In my experience the best business practice is to offer the product owner/manager the costed options in such a situation and allow them to choose.


That only really matters if your backend developers are a different team to your frontend developers where they'd want to be working concurrently. And even then, they could work in different branches and both teams merge into a development branch when finished.

The idealistic discussions for or against monorepos often overlook the most important detail: who's working on the code and how would you want them to version control it?

If it's separate projects with their own versioning then it makes sense to have them as separate repositories. If it's a single project but with individual components you'd want to version (eg because it's developed by different teams with different release timelines) then there you also have a situation where you'd want to version the code separately so once again there is a strong argument for separate repositories. However if it's one product with a single release schedule then splitting up the frontend from the backend can often be a completely unnecessary step if you're doing it purely for arbitrary reasons such as the languages being different. (I mean Git certainly doesn't care about that. A project might have Bash scripts, systemd service files, Python bootstrapping, code for an AOT compiled language (eg Rust, Go, C++, etc), YAML for Concourse, etc. They're all just text files required for testing and compiling so you wouldn't split all of those into dozens of separate repos).


> That only really matters if your backend developers are a different team to your frontend developers

What if there is one team, but different developers (one working on the frontend, another on the back)? What if QA can test the API while the frontend development is ongoing?

What if the front and backends have different toolchains, and ultimately separate execution environments (server app backend vs JS running on client machines).


I’m not sure what your point is. There’s obviously going to be thousands of different scenarios that I didn’t cover; it would be impossible to cover every imaginable use case.

> What if there is one team, but different developers (one working on the frontend, another on the back)?

Then presumably everyone in that team are full stack?(Otherwise it would be different teams in the same department) so it still makes sense to have a monorepo because you could have a situation (holiday, sickness) where someone would be working on both the front end and back end. Thankfully got is a distributed version control solution and supports feature branches so you can still have multiple people working on the same repo and then merge back into a developement branch.

> What if QA can test the API while the frontend development is ongoing?

Testing isn’t the same as released versions. You can (and should) test code at all stages of development regardless of team structures, git repo structures nor release cycles.

> What if the front and backends have different toolchains, and ultimately separate execution environments (server app backend vs JS running on client machines).

I’d already covered that point when talking about different languages in the same repo. You’re making a distinction about something that version control doesn’t care in the slightest about.

I think it’s fair to say any significant cross-project tooling should be it’s own repo (you wouldn’t include the web browser or JVM with your frontend and backend repos). But if it’s just bootstrapping code that is used specifically by that project then of course you’d want that included. Eg you wouldn’t have Makefiles separate from C++ code. But you wouldn’t include GCC with it because that’s a separate project in itself.

Ultimately though, there is no right answer. It’s just what works best for the release schedule of a product and teams who have to work upon that project.


This is true for systems where there is a well-defined protocol between GUI and servers and a proper versioning process in place, i.e. most "old-school" client/server systems.

I expect lots of people on HN are working on systems with very tight coupling between client/GUI and server and no proper versioning between them, as is common in web applications. Hence the replies to the contrary: you're probably from quite different worlds :)

(Now, I personally think that maintaining sound versioning practices is a good idea even if you do have tightly coupled control of both the client and the server side. But that may just be me...)


I think what matters in the end is Conway's law. Conway's is frequently misinterpreted as an observation when it's actually advice: Structure your applications/repos like you structure your teams. You're going to end up with that code structure anyway, so might as well save some time.


Hmm, that's not really obvious to me. Sometimes the server has to deliver new data that is to be used in the GUI, so it's nice to be able to present those together in the same PR. If it then happens that the server-side changes do not match what you need in the GUI, it's relatively painless to add those changes in the same branch that hasn't yet been merged. In other words: although you can make changes in one without breaking the other, that doesn't make them completely isolated.


You should always have a communication layer between the gui and the server. For example using protobuf you would update the proto definition (that can be in a shared repo) and when building the gui and the server the protobuf layer is regenerated. So the only place where you make your changes for the new data contract is the shared repo and the gui and server would automatically have the new changes.


So now we're at three repo's, one of which is shared by the other two, and changes will have to be coordinated over them. I fail to see how that is an improvement over having both in the same repo.

In the end, I think the other comments are right that it mostly depends on who's working on something. If it's different teams, then different repo's probably make sense. But if I'm responsible for both the back-end and the front-end, they're usually not isolated at all at least in terms of project requirements, and hence keeping them together makes sense.

(But of course, even then there's nuances. I think the article is mostly arguing against monorepo's as in company-wide monorepo's. I'm willing to believe Googlers that it works well for Google, and I'm not in a position to claim what it'd be like for other companies. Team-wide monorepo's for different parts of the same project, however, make a lot of sense to me.)


> I'm willing to believe Googlers that it works well for Google

It doesn't. In my entire career, that was the only environment in which some random would break us and we couldn't do anything about it other than hope for a rollback and then wait for hours for the retest queue to clear before we could deploy anything at all.

Maybe not all the time, but you need the escape hatch of pinning healthy deps, because HEAD of everything is not guaranteed to work.


Well, I'm willing to believe you that it didn't work well for you as well. My point is that company-wide monorepo's are largely irrelevant to my point, as I'm not arguing in favour or against those (I'm leaving that to people who've worked with it).


It'll be really typical for a gui/server to want to share some is_valid_payload() function. The client to validate it before sending, and for the server to do its own validation.

If it's a monorepo your PR might be a 2 line patch to that function, then adding the GUI and server code.

If you split it you'll first need to have a PR on the "validation-lib" repo, then once that gets in a PR on the "server" repo, bumping the "validation-lib" version dependency, and finally a PR on the "gui" repo bumping the dependency for both "validation-lib" and "server" (for testing etc.). That's before you need do deal with the circular dependency that "server" also wants "gui" for its own "I changed my server code, does the GUI work?" testing.

Better just to have them in a monorepo if they're logically the same code and want to share various components.


> If it's a monorepo your PR might be a 2 line patch to that function, then adding the GUI and server code.

> If you split it you'll first need to have a PR on the "validation-lib" repo, then once that gets in a PR on the "server" repo, bumping the "validation-lib" version dependency, and finally a PR on the "gui" repo bumping the dependency for both "validation-lib" and "server" (for testing etc.). That's before you need do deal with the circular dependency that "server" also wants "gui" for its own "I changed my server code, does the GUI work?" testing.

The above is exactly why I am so firmly opposed to multirepo[0]-first. And it's really just a throwaway example: a real change would involve multiple different library and executable repos, all having separate PRs. And then there's the relatively high risk of getting a circular incompatibility.

This can be worth the cost, for organisational reasons. But until you need it, don't do it. It's very easy to split a git repo into multiple repos, each retaining its history (using git filter-branch). Don't incur the pain until you need to, because honestly, you're not likely to need to. You're probably not going to grow to the size of Google. Heck, most of Google runs in one monorepo, with a few other repos on the side: if they can make it work at their scale, so can you. And if, as the odds are, you never grow to their size, then you'll never have wasted time engineering a successful multirepo system instead of delivering features to your business & customers.

0: 'polyrepo,' really? https://trends.google.com/trends/explore?date=all&q=multirep... clearly shows that 'multirepo' is term.


These are two separate functions why would you ever want a function that checks both gui and server? The gui validation logic belongs to the gui layer, the server validation logic to the server layer. If you have a function that contains logic from both layers there is something seriously wrong with your design.


The classic reason for any validation is that you want the validation to be done in the frontend (to save a network roundtrip and provide better, immediate feedback), on the backend (so that if the frontend is compromised and maliciously circumvents that validation, it still gets validated), and both of the validations to be the same to prevent inconsistencies.

A good way to fulfil those requirements is to have the exact same function available in both places.


If you have the same functionality that can't be re-used (for no reason), then I'd call that a design flaw.

I'll need a few more validation functions for each clients. I don't want to write+maintain multiple functions that do the same thing, even if it's just copy+paste.

It's "data" validation. So let's put that in the "data layer" repo.

We now have, at least:

- Server

- Web (GUI)

- Android

- iOS

- Data

- More clients?

We'll also have branches for each development task. How do we know what branch the other branches should use? One "simple" feature can easily spread over multiple repos. Does each repo refer to the repo+branch it depends upon (don't forget to update the references when we merge!), or we add a "build" repo which acts as the orchestrator?

Most PRs will need to be daisy chained - who reviews each one? Will they get comitted at the same time?

How do we make the builds reproducible? commit hashes? tags? ok, we now need to tag each repo, and update the references to point to that tag/hash... but that changes the build.

Well, I'm glad our code base is split over multiple repos because "scalability".


Imagine something like "curl" where a client needs to validate a manually provided request before making it.

In any case, if you're nitpicking that example you're missing the point. The same would go for any number of other shared code you could imagine between a client/server trying that logically make up one program talking over a network.


I still can’t see how you would have a shared library for a C# gui and a Java server for example. Your communication layer would obviously live in both repositories. Even in case you are using the same language and you do have shared libraries then what is the problem? The shared libraries would surely be shared with other projects so it makes sense to have them in a separate repository.


In cases where there's a high degree of churn (i.e. early-stage startups) in shared libraries, updating those libraries can cause a large amount of busywork and ceremony.

If you had a `foo()` function shared between the GUI and the server (or two services on your backend, or whatever), in a monorepo your workflow is:

   - Update foo()
   - Merge to master
   - Deploy
In a polyrepo where foo() is defined in a versioned, shared library your workflow is now:

   - Update foo()
   - Merge to shared library master
   - Publish shared library
   - Rev the version of shared library on the client
   - Merge to master
   - Deploy client
   - Rev the version of shared library on the server
   - Merge to master
   - Deploy server
This problem gets even more compounded when your dependencies start to get more than one level deep.

I recently dealt with an incredibly minor bug (1 quick code change), that still required 14 separate PRs to get fully out to production in order to cover all of our dependencies. That's a lot of busywork to contend with.


It seems to me that the real problem is your toolchain. In a previous project the workflow was like this:

Update foo() Merge to master Publish shared library Deploy

So as you can see the only step added was to publish the shared library that would automatically update the version in all the projects using it. If you are really doing everything manually I can understand that this is a pain, but this has nothing to do with the monorepo / multiple repo distinction, this is a tooling problem.


But you've just invented a sharded monorepo, and now have all the monorepo problems without the solutions.

What if updating foo() breaks something in one of the clients (say due to reliance on something not specified). Then you didn't catch that issue by running client's tests, now client is broken, and they don't necessarily know why. They know the most recent version of shared broke them, but then they have to say "you broke me" or now one of the teams needs to investigate and possibly needs to bisect across all changes in the version bump under their tests to find the breakage.

How is that handled?

(the broader point here is that monorepo or multirepo is an interface, not an implementation, its all a tooling problem. There are features you want your repo to support. Do you invest that tooling in scaling a single repo or in coordinating multiple ones? Maybe I should write that blog post).


Some package managers that support git repos as dependency versions can offset this in development.


>It makes absolutely sense to have a repository for the gui and one for the server.

Not really. You can have a single repo with top level directories tigershark-gui and tigershark-server.


What is the point instead of having them in two separate repos?


Any full stack change will be represented by one PR that changes from pre change to post change. Two repos would introduce a new possible state where one has the change applied and the other doesn't.


And later you add the iOS and Android clients too. Will those go into the same repo? Better to keep server and clients apart, especially if release schedules are different.


Sure if the release schedules are different then have them in separate repos so things like tagging makes sense. But often people work with a single release schedule. There's just so many variables that go into these decisions that the thread here is bonkers.

Smart people can work through problems to get the job done. Monorepo vs polyrepo won't stop people from moving forward.


> As a final observation, you can split big repositories into smaller ones quite easily (in Git anyway) but sticking small repositories together into a bigger one is a lot harder. So start out with a monorepo and only split smaller repositories out when it's clear that it really makes sense.

If you only need to do this once, subtree will do the job, even retaining all your history if you want.

I'm not sure what the easier way to split big repos is.


To split, you can duplicate the repo and pull trees out of each dupe in normal commits.


In principle: Yes.

In practice, I can tell you from first-hand experience that this isn't all that simple in bigger, organically grown cases (you'll have many other things to consider if you want to keep the history in a useful way). Especially the broken branching model of SVN and co. is a problem here: In the wild, it immediately leads to "copy&paste branching" (usually through multiple commits. Migrating that to Git or Hg and splitting it up can be a challenge.


I haven't tried in Git, but with Mercurial merging repos is as simple as pulling from an unrelated repository and merging, that's it. It's a lot simpler than splitting a repo up unless you accept that all of the old history can remain, then you just make a clone and delete what should no longer be a part of the repository.

But monorepo leads to tight coupling, and that is just as much a pain to work on as versioning, or two teams are simultaneously working on the same shared code, and you have not only merge conflicts, but conflicting functionality.


So why is that? Why do we ned to couple together the software development efforts with release? Based on my experience there is no difference between the monorepo vs multirepo approach from the deployment point of view.


After trying to get the best of both with Subversion Externals and Git Submodules, I'd have to agree. At least until things are so loosely coupled they're begging for a public release.

That said, some packaging solutions can bridge the gap reasonably well. Unless you need instantaneous, atomic releases.


I switched to using submodules about a year ago, and they work very well for a project + a set of 4 dependencies. I handle that zoo from VS Code + Git Lens plugin.

Funnily, I only use Code to handle commits to submodules, because Git Lens is not available for the full VS IDE.


What are you talking about! In my perfect micro services world I just have these enforced bounded contexts that are so perfectly designed they never need to change. Consequently all parts of the system are perfectly independent snowflakes that can be deployed without thinking about any other parts of the system. It’s beautiful really when you think about the mess that things were before we could do this!


I generate Coq proofs of Swagger descriptions that were compiled from a speech to text dump during a 10 person Hangout. Downside is that some of the protobufs aren't laid out as cleanly as one would like.


While I know you are being sarcastic, I really have heard bushy tailed young “architects” say something similar who just read about Domain Driven Design and then decided they were trying to “educate us”.


Oh I worked on a project like this, which still hasn’t launched any software yet 5 months after I left...


I can think of situations where components 'need' to release together because of organizational rules and not any actual binding between the components, in that case of course they do not need to be in the same repository.

I agree that you should always start with one repo and split as needed, it's the MVR way (minimum viable repository)



My problem with polyrepos is that often organizations end up splitting things too finely, and now I'm unable to make a single commit to introduce a feature because my changes have to live across several repositories. Which makes code review more annoying because you have to tab back and forth to see all the context. It's doubly frustrating when I'm (or my team is) the only people working on those repositories, because now it doesn't feel like it gained any advantages. I know the author addresses this, but I can't imagine projects are typically at the scale they're describing. Certainly it's not my experience.

Also I definitely miss the ability to make changes to fundamental (internal) libraries used by every project. It's too much hassle to track down all the uses of a particular function, so I end up putting that change elsewhere, which means someone else will do it a little different in their corner of the world, which utterly confuses the first person who's unlucky enough to work in both code bases (at the same time, or after moving teams).


My current team managed to break a single "component" out into a separate repository. Then that repository broke into two, then those broke into other repositories, until we've eventually have around 10 or so different repositories that we work on every day.

An average change touches 4 of them, and touching one of them triggers on average releases on 2 or 3 of them. Even building these locally is super tedious, because we don't have any automation in place (not formally plan to) for chain building these locally.

This is a nightmare scenario for myself. A simple change can require 4 pull requests and reviews, half a day to test and a couple hours to release.

Yet my team keeps identifying small pieces that can be conceptually separated from the rest of the functionality, even if they are heavily coupled, and makes new repos for these!


I’ve come to the conclusion that an organisation should ideally have no more than one primary repo, with maybe a handful of ancillary repos for stuff that really doesn’t make sense in the primary. What does ‘organisation’ mean there? Well, it could mean a company, or a team, or a division. Just as software conforms to organisational structure (Conway’s Law), so too should repo structure.

Once you start having lots of peer repos being worked on within the same organisation on a daily basis, you know that you’ve partitioned far too far, and you need to roll back.

Otherwise one ends up in exactly the position you’re in. The ultimate slippery-slope end-state would be hilariously bad: a repo for each ASCII character, with repos for each word or symbol constructed out of those characters, with repos for each function constructed out of those words & symbols, with repos for each module constructed out of those functions, with repos for each system constructed out of those modules, with any change requiring a massive, intricate, failure-prone dance in order to update anything, all while patting oneself on the back about how one has avoided complexity.

Noöne sane would argue for that situation, and yet I’ve seen smart people argue that requiring coördinated changes to half a dozen repos is fine & dandy.


even if they are heavily coupled,

So don't use polyrepos for heavily coupled projects, then. Or even better...

... try to avoid heavy coupling in the first place.


Unfortunately, these debates tend to be of the bikeshed variety.

Q: Why are we debating the merits of mono-repos over poly-repos?

A: Because it's managing dependencies is really hard and needs expertise.


It's an interesting social problem in how you manage those project / library / repository boundaries. On the flipside, though, it's been well documented that among many of the major monorepos those boundaries still exist, they just become far more opaque because no one has to track them. You find the weird gatekeepers in the dark that spring out only when you get late in your code review process because you touched "their" file and they got an automated notice from a hidden rules engine in your CI process you didn't even realize existed.

In the polyrepo case those boundaries have to be made explicit (otherwise no one gets anything done) and those owners should be easily visible. You may not like the friction they sometimes bring to the table, but at least it won't be a surprise.


http://wiki.c2.com/?ConwaysLaw "Conway's Law" is something like "organisation of code will match the organisation of people". It's a neat description.

I think it's more common to merge or split modules and classes than repositories. I wonder if there'd be less tension if repos and teams were 1:1 though.


> I wonder if there'd be less tension if repos and teams were 1:1 though

Anecdotally, yes I think it helps a lot. I was once part of an organization for which each "team" having a repo is the only thing that prevented violence :-)


I've also seen arbitrary separations of repos because 2 people didn't get along and couldn't work together.


Can a monorepo support module- or subdirectory-level ownership controls? Or do teams using a monorepo just do without them?

Partially answering my own question: SVN, recommended in a prior comment [0], supports path-based authorization [1]. But what about teams using another version control system?

[0] https://news.ycombinator.com/item?id=18810313

[1] http://svnbook.red-bean.com/en/1.5/svn-book.html#svn.serverc...


In Google, this is pretty explicit with a plaintext OWNERS file in the directory. Internal IDEs not only have an understanding of that, but can automatically suggest a minimal set of reviewers in a possibly close time zone and not out of office.


Piper, Google’s implementation of monorepo, has that, and it is very important and widely used.


With Phabricator, yes, you can setup herald rules that stops a merge from happening if a file has changed in a specific subdir.

We use service owners, so when a change spans multiple services, they are all added automatically as blocking reviewers.


For my clients, I use an open source monorepo submodule inside the client's proprietary monorepo. I can maintain the organizations' software while sharing common code.

So (mono)repos are composable.


If I remember correctly, Gitlab has introduced some sort of ownership control where you can say who owns what directories for things like approving merge requests that affect those directories.


Hey, did you mean of assigning approvers based on code owners [1]? You can find more info about Code Owners and syntax at the documentation [2].

[1] https://gitlab.com/gitlab-org/gitlab-ee/issues/1012 [2] https://docs.gitlab.com/ee/user/project/code_owners.html


Github has this feature[1], which we use extensively in our monorepo.

[1] https://help.github.com/articles/about-codeowners/


google3 uses the same model that Chromium does, see an example here: https://github.com/chromium/chromium/blob/master/chromeos/OW...


SVN allows for you to create mutliple repos within a repo. (That's probably why the path based auth works).

Git has the idea of sub-modules, but they're really just filters. (They're in the same repo). So ultimately, you don't have that kind of control.


Git submodules are not in the same repo, they are a link from one repo to another, and you need to push to both if you make a change to the submodule. Maybe you're thinking of subtrees? I've never used those.


OctoLinker really helps when browsing a polyrepo on Github:

https://github.com/OctoLinker/OctoLinker

You can just click the import [project] name and it will switch to the repo.


It's very much possible to make changes to internal libraries used all over the place, but it does require versioning to be something that people think about, and a mechanism by which those libraries aren't just pulled from source control to depend on them. Once you've got some sort of dependency management such as an internal gem/npm/whatever source you can treat those internal dependencies the same as you'd treat external ones, instead of having to somehow coordinate a release of absolutely everything in one go.


That's not really that different in a monorepo since you often need reviews from the same number of people anyway.

I once had to wait for 9 months to get a complex change through in a monorepo setting because of all the people involved, the number of stuff it touched and the fact that everything was constantly in flux so I spent half my time tracking changes. I'm not saying it would have been faster in a polyrepo. I'm saying that complex changes are complex regardless of how the source is organized.

I do however think that polyrepos forces you to be more disciplined and that it is easier to slip up in a polyrepo and turn a blind eye to tighter couplings.


The multi-repository code review is an interesting concept. Here at RhodeCode we're actually working on such solution to implement. This is in first to solve our internal problem of release code-review spanning usually two projects at once.

This is a hard and complex problem. Especially how to make code-review not too messy if you target 5-8 repos at once.


I think this article is complete horseshit. A monorepo will serve you 99% of the time until you hit a certain level of scale when you get to worry about whether a monorepo or a polyrepo is actually material. Most cases are never going to get there. Before that point, a polyrepo is purely a distraction and makes synchronous deployment really painful. We had to migrate a polyrepo to a monorepo and it was not fun because it was a migration that should have never had to be done in the first place. Articles like this are fundamentally irresponsible.


I work on CI/CD systems, and that’s one thing that definitely gets harder in a monorepo.

So you made a commit. What artifacts change as a result? What do you need to rebuild, retest, and redeploy? It doesn’t take a large amount of scale to make rebuilding and retesting everything impossible. In a poly repo world, the repository is generally the unit of building and deployment. In monorepo it gets more messy.

For instance, one perceived benefit of a monorepo is it removes the need for explicit versioning between libraries and the code that uses them, since they’re all versioned together.

But now, if someone changes the library, you need to have a way to find all of its usages, and retest those to make sure the changed didn’t break their use. So there’s a dependency tree of components somewhere that needs to be established, but now it’s not explicit, and no one is given the option to pin to a particular version if they can’t/won’t update. This is the world of google & influcenced the (lack of) dependency management in go.

You could very well publish everything independently, using semver, and put build descriptors inside each project subdirectory, but then, congratulations, you just invented the polyrepo, or an approximation thereof.


> So you made a commit. What artifacts change as a result? What do you need to rebuild, retest, and redeploy?

If you're using Git, then typically for each push to the remote repository you get a notification with this data in it:

  BRANCH        # the remote branch getting updated
  OLD_COMMIT    # the commit the branch ref was pointing to before the push
  NEW_COMMIT    # the commit the branch ref was pointing to after the push

  # To get the list of files that changed in the push:
  git diff --name-only "$OLD_COMMIT" "$NEW_COMMIT"
Once you know which files changed in a push you can figure out which artifacts you need to build. Right now you'll have to write that tooling yourself since I don't know of any off-the-shelf tools that do it. In my company's case, we have "project.yml" files scattered through the repo telling us which directories have buildable artifacts and what branches each one needs to be built for. The tooling to support this is a few hundred lines of Bash and Python. In our case we're still small enough that we can brute force some stuff, but we can easily improve the tooling as we go along.


This is something I've been working on a bit myself.

Figuring out which files change is relatively easy (as you've demonstrated). Figuring out what the impact of that is quite hard in non-compiled languages (tools like Maven, Buck, Bazel, etc do this well for compiled languages). I.e. In a repo which is primarily JavaScript, I can get the list of changed files, and hopefully have unit test files which are obviously linear to those. However, knowing if these are depended on by other files/modules (at some depth) is much harder. Same for integration tests -- which of these are related?


I believe the typical approach is to have project.yml list its dependency projects. Build a DAG(error on cycles) and then build all changed and downstream projects.


Rebuild and deploy everything, what's the actual problem? Like the OP said, that's a scale issue and most projects don't have it.

Also building/testing is far more effective at finding dependencies than just going by repo structure. There are numerous package managers available to solve versioning if you need separate components.


100% agree with your entire comment. This is what we do with our monorepo now -- it turns out the rebuilding and deploying everything is actually just fine. If your application services are stateless and decoupled from your state stores, it's completely harmless. If you need to do something fancy, congrats! You're at scale -- enjoy it but remember that it's something rare.


Yes! This brings to mind Donald Knuth: "Premature optimization is the root of all evil."


One thing I heavily enjoy about monorepo's (I'm talking java/c#/c++ projects) is the ability to navigate the entire codebase from within an ide. That alone has caused me to migrate projects (medium projects ~20 developers) from poly to mono repos. Dropping tons of duplication in the build system in the process. I can think of good reasons to split projects along boundaries when it makes sense, but not blindly by default, and not without carefully considering the tradeoffs.


bazel/buck/pants all solve this, but independently of that, they're probably the best build systems.


In the java world this gets solved with gradle's incremental build system, which uses a build cache, a user configured dependency tree, and some hashing to determine what needs to build.


I found it to be neither horseshit nor irresponsible. A bit overdrawn and skewed in some of its arguments, perhaps. But then again... so was your critique. For example:

We had to migrate a polyrepo to a monorepo and it was not fun because it was a migration that should have never had to be done in the first place

s/polyrepo/monorepo/ in the above and you have an assertion of about equal plausibility and weight.


No, it is horseshit. 99% of companies will never hit big company VCS scaling issues, and once they do, they're on their own. To characterize that scale as common is one of the most embarrassing failures of modern software engineering. People are so embarrassed to use well worn tooling and accept that large scale is both uncommon and something that doesn't invalidate tried and true patterns for smaller scales. It's utterly baffling to me.


> It's utterly baffling to me.

It's not hard to explain: Scale has been fetishized by the industry/trade. Everyone wants the cachet of working at scale. 1.5 GB of CSV text? That's Big Data, let's break out map-reduce. 1 load balancer and not enough servers to fill half a rack? That's a scalable architecture, we could scale to multiple datacenters at some point in the future, so let's design it now.

Deploying oversized solutions is partly due to outsiders jonesing the scale of Google, Fb and gang, partly Resume-stuffing ("I have worked with this tech before"), and lastly FANG diasporans who miss the tech they used and rewrite systems/evangelize the effectiveness of those solutions to much smaller organizations.


To be fair, part of the problem is that each of us has been bitten throughout his career by issues which could have been prevented by being able to predict the future. We then move from the truth that if we had known the future, we could have acted better yesterday to the fallacy that today we finally know what we're going to need tomorrow.

This isn't isolated to our industry, of course: a constant refrain is that generals & admirals fight the last war; the financial industry is rife with products which are secure against the last recession, and so forth.


"To characterize that scale as common is one of the most embarrassing failures of modern software engineering."

This point cannot be stressed enough. Almost all the worst software engineering failures I have seen have been caused by premature scaling - which is way worse than premature optimization because the latter's effects are usually local. But premature scaling causes architectural decisions that affects the whole project and simple cannot be undone.

One example among many are some of the influential engineers insisting on that we needed four application servers with fail-over because they had experienced servers crashing under heavy load. This complicated failover setup took huge amount of time and resources to setup, delaying the project by months. In the end it only attracted a few hundred visitors per day and was cancelled in under a year.


This complicated failover setup took huge amount of time and resources to setup, delaying the project by months.

Hmm - failover shouldn't be that hard to set up. If it was then that suggests that other issues (technical debt, inexperienced management) were the more likely culprits.

Not the simple fact that they chose not to ignore the need for failover.


> [it] shouldn't be that hard ...

Now where have I heard those words before... :)


> 99% of companies will never hit big company VCS scaling issues

A much higher percentage of developers will. Number of companies is not a good metric for whether a topic is worthy of discussion.


Number of companies is a good metric, because companies own the repos and if it becomes a pain-point, only the developers working at that point in time will be hit by this. Anyone who leaves before this inflection point or joins after it's been solved will not be hit, so I don't think the percentage of developers in that intersection is large.


> after it's been solved

I think a quick perusal of this page will show that it's not really "solved" after all. A far higher percentage of developers continue to be affected by large-repo issues than a Python-specific issue (currently #1 story on the front page) or anything to do with Ethereum (currently #7). Are those "horseshit" topics too?


I agree, it's not really solved, but solved "enough". You can't have your cake and eat it, there are tradeoffs involved- if you grow large enough to hit monorepo limitations, you are large enough to invest in tooling that manage your workflow (the tradeoff). However, if you're a small organization, you can't afford the tooling and you're wasting time/quality coordinating polyrepo releases, so you are better off with a monorepo.

> A far higher percentage of developers continue to be affected by large-repo issues than...

Are you suggesting that the results of the HN ranking algorithm at this very moment in time is a good metric of measuring what affects developers? I don't agree, and besides @yowlingcat's opinion that the article is "horseshit" is unrelated to how well its ranked on HN.


> opinion that the article is "horseshit" is unrelated to how well its ranked on HN.

When the opinion is not just disagreement but outright dismissal of the topic as worth discussing, I'd say ranking is relevant. So is comment count. Clearly a lot of people do believe it's worth discussion, not irrelevant or a foregone conclusion as yowlingcat tried to imply.


A lot of people can think a lot of things are worth discussion, but it doesn't mean it's prudent to waste time on it.


Incidentally, I also think those are horseshit topics as well (Coconut is someone trying to daydream Python into Haskell with no practical reasons to do so and making Ethereum scale better doesn't make a legitimate use case for it emerge) but that's besides the point.

What you call large-repo issues I call organization issues. From your other comments, it's clear that we draw the lines at different places, but I think I'm right and you're wrong in this case because I've seen engineers try to solve organizational issues with technology enough times that it's a presumable anti-pattern. Why don't we take your own words at face value?

"That hasn't been my experience. Yes, it's a culture thing rather than a technology thing, but with a monorepo the "core" or "foundation" or "developer experience" teams tend to act like they're the owners of all the code and everyone else is just visiting. With multiple repos that's reversed. Each repo has its owner, and the broad-mandate teams are at least aware of their visitor status. That cultural difference has practical consequences, which IMO favor separate repos. The busybodies and style pedants can go jump in a lava lake."

Why are there busybodies and style pedants working in your organization? Because your organization has an issue. Do you think that would be at the root of this pain, or a tool choice? I'll give you a hint, it's not the tool choice.


> Why are there busybodies and style pedants working in your organization?

Because to an extent they serve a useful purpose. In a truly large development organization - thousands of developers working on many millions of lines of code - fragmentation across languages, libraries, tools, and versions of everything does start to become a real problem with real costs. You do need someone to weed the garden, to work toward counteracting that natural proliferation. That improves reuse, economies of scale, smoothness of interactions between teams, ease of people moving between teams, etc. It's a good thing. Unfortunately...

(1) That role tends to attract the very worst kind of "I always know better than you" pedants and scolds. Hi, JM and YF!

(2) Once that team reaches critical mass, they forget that the dog (everyone else) is supposed to wag the tail (them) instead of the other way around.

At this point, Team Busybody starts to take over and treat all code as their own. Their role naturally gives them an outsize say in things like repository structures, and they use that to make decisions that benefit them even if they're at others' and the company's expense. Like monorepos. It's convenient for them, and so it happens, but that doesn't mean it's really a good idea.

Sure, it's a culture issue. So are the factors that lead to the failure of communism. But they're culture issues that are tied to human nature and that inevitably appear at scale. I know it's hard for people who have never worked at that scale to appreciate that inevitability, but that doesn't make it less real or less worth counteracting. One of the ways we do that is by putting structural barriers in the corporate politicians' way, to maintain developers' autonomy against constant encroachment. The only horseshit here is the belief that someone who rode a horse once knows how to command a cavalry regiment.


You realize many, if not most, people reading this work at places already big enough to have "VCS scaling issues". I've seen more than a few monorepos, but I've never seen one used as anything but a collection of small repos.


No, it is horseshit [because scale]

The thing is, scale was only one factor listed among many.


Was it? Once scale problems is gone -- you assume that all code can be checked out on one machine, and you have enough buildfarm to build all the code -- the most of the article's points no longer apply.

The downsides which still apply are Upside 3.3 (you don't deploy everything at once) and Downside 1 (code ownership and open source is harder).

And those are pretty weak arguments -- I would argue that deploying problems exists with polyrepo as well, and there are now various OWNERS mechanisms.

The fact the polyrepos are harder to open source is a good point, but having to maintain multiple separate repos just in case we would want to opensource one day seems like sever premature optimization.


In my experience, monorepos cause outrageous problems that have nothing to do with scale. Small or medium monorepos are equally as terrifying.

It’s much more about coupling and engendering reliance on pre-existing CI constraints, pipeline constraints, etc. If you work in a monorepo set up to assume a certain model of CI and delivery, but you need to innovate a new project that requires a totally different way to approach it, the monorepo kills you.

Another unappreciated problem of monorepos is how they engender monopolicies as well, and humans whose jobs become valuable because of their strict adherence to the single accepted way of doing anything will, naturally, become irrationally resistant to changes that could possibly undermine that.

It’s a snowball effect, and often the veteran engineers who have survived the scars of the monorepo for a while will be the biggest cheerleaders for it, like some type of Stockholm syndrome, continually misleading management by telling them the monorepo can always keep growing by attrition and will be fine and keep solving every problem, unto the point that it starts breaking in collossal failures and people are sitting around confused why some startup is eating their lunch and capable of much faster innovation cycles.


Oddly enough, you could s/mono/multi in your post and that would exactly align with my own experience. I'm not kidding: everything from engendering reliance on weird homegrown tooling, CI & build pipelines to the pain of trying to break out to a different approach, to enforced bad practices, to developers (unknowingly) misleading management, to colossal failures.

I've worked on teams with monorepos and teams with multiple repos, and so far my experience has been that monorepo development has been better — so much so that I feel (but do not believe) that advocating multiple repositories is professional malpractice.

Why don't I believe that? Because I know that the world is a big place, and that I've only worked at a few places out of the many that exist, and my experience only reflects my experience. So I don't really believe that multiple repositories are malpractice: my emotions no doubt mislead me here.

I suspect that what you & I have seen is not actually dependent on number of repositories, but rather due to some other factor, perhaps team leadership.


Everyone always says this type of response about everything though. If you like X, you’ll say, “In my experience you can /s/X/Y and all the criticisms of X are even more damning criticisms of Y!”

All I can say is I’ve had radically the opposite experience across many jobs. All the places that used monorepos had horrible cultures, constant CI / CD fire drills and inability to innovate, to such severe degrees that it caused serious business failures.

Companies with polyrepos did not have magical solutions to every problem, they just did not have to deal with whole classes of problems tied to monorepos, particularly on the side of stalled innovation and central IT dictatorships. Meanwhile, polyrepos did not introduce any serious different classes of problems that a monorepo would have solved more easily.


Absolutely amazing to me how much engineers conflate organizational issues with tooling issues. Let's take a look at one of your comments:

"The last point is not trivial. Lots of people glibly assume you can create monorepo solutions where arbitrary new projects inside the monorepo can be free to use whatever resource provisioning strategy or language or tooling or whatever, but in reality this not true, both because there is implicit bias to rely on the existing tooling (even if it’s not right for the job) and monorepos beget monopolicies where experimentation that violates some monorepo decision can be wholly prevented due to political blockers in the name of the monorepo.

One example that has frustrated me personally is when working on machine learning projects that require complex runtime environments with custom compiled dependencies, GPU settings, etc.

The clear choice for us was to use Docker containers to deliver the built artifacts to the necessary runtime machines, but the whole project was killed when someone from our central IT monorepo tooling team said no. His reasoning was that all the existing model training jobs in our monorepo worked as luigi tasks executed in hadoop.

We tried explaining that our model training was not amenable to a map reduce style calculation, and our plan was for a luigi task to invoke the entrypoint command of the container to initiate a single, non-distributed training process (I have specific expertise in this type of model training, so I know from experience this is an effective solution and that map reduce would not be appropriate).

But it didn’t matter. The monorepo was set up to assume model training compute jobs had to work one way and only one way, and so it set us back months from training a simple model directly relevant to urgent customer product requests."

What do you think is the cause of your woes, the monorepo, or the disagreement between your colleague in central IT tooling who disagreed with you? Where was your manager in this situation? Where was the conversation about whether GPU accelerated ML jobs were worth the additional business value to change the deployment pipeline? Was that a discussion that could not healthily occur? Perhaps because your organization was siloed and so teams compete with each other rather than cooperate? Perhaps because it's undermanaged anarchy masquerading as a meritocracy? Stop me if this sounds too familiar.

I've been there before. I know what it feels like. But, I also know what the root cause is.


Nobody is conflating anything. Culture / sociological issues that happen to frequently co-occur with technology X are valid criticisms of technology X and reasons to avoid it.

To argue otherwise, and draw attention away from the real source of the policy problems (that the monorepo enables the problems) is a bigger problem. It’s definitely some variant of a No True Scotsman fallacy: “no _real_ monorepo implementation would have problems like A, B, C...”.

The practical matter is that where monorepos exist, monopolicies and draconian limitations soon follow. It’s not due to some first principles philosophical property of monorepos vs polyrepos — who cares! — but it’s still just the pragmatic result.

Also you mention,

> “Where was the conversation about whether GPU accelerated ML jobs were worth the additional business value to change the deployment pipeline.”

but this was explicitly part of the product roadmap, where my team submitted budgets for the GPU machines, we used known latency and throughput specs both from internal traffic data and other reference implementations of similar live ML models. Budgeting and planning to know that it was cost effective to run on GPU nodes was done way in advance.

The people responsible for killing the project actually did not raise any concern about the cost at all (and in fact they did not have enough expertise in the area of deploying neural network models to be able to say anything about the relative merit of our design or deployment plan).

Instead the decision was purely a policy decision: the code in the monorepo that was used for serving compute tasks just as a matter of policy was not allowed to change to accommodate new ways of doing things. The manager of that team compared it with having language limitations in a monorepo. In his mind, “wanting to deploy using custom Docker containers” was like saying “I don’t want to use a supported language for my next project.”

This type of innovation-killing monopolicy is very unique to monorepos.


Here here yowlingcat. Article is a way too prescriptive and agreed, borders on irresponsible. Monorepo vs polyrepo argument is way too broad a subject to create generalized stereotypes like this. These opinions sadly are taken as facts by impressionable managers, new developers, etc, and have cascading effects on the rest of us in the industry. Use what makes sense in the project environment and team, don't just throw shade at teams who are successfully and productively using monorepos where they make sense. Sure there is good reason to split things up on boundaries sometimes, (breaking out libraries, rpc modules, splitting along dev team boundaries, etc etc etc), but not blindly by default. Will Torvalds split up the kernel into a polyrepo after reading this article? Something tells me that would be a bit disruptive.


It's interesting that you talk about "team's using monorepos". I think that's different than what the article is arguing against, which is an entire company (100+ devs) using a monorepo.

A team with 5 services and a web front-end in a single repo is doable with regular git. It's a different beast I think.


Thanks softawre, what triggered me is the sensationalist title and general bashing of monorepo's (which a large percentage of impressionable readers will walk away from this article thinking, ie: that monorepo's are only for dummies and you're doing it wrong if you're not using a polyrepo). A less inflammatory title more along the lines of "Having trouble scaling development of a single codebase amongst 100's of developers? Consider a polyrepo". This argument comes up in developer shops almost as much as emacs vs vi, tabs or spaces, etc.

When you have 100+ developers on a project, managing inbound commits/merges/etc will become tedious if they're all committing/merging into one effective codebase.

IMHO, It depends on the project, the team makeup, the codebase's runtime footprint, etc whether or not/or when it makes sense to start breaking it up into smaller fragments, or on the other hand, vacuuming up the fragments into a monorepo.

I did enjoy reading Steve Fink's from Mozilla's comment (it's the top response on the OP's medium article) and counter arguments about monorepos vs polyrepos in that ecosystem (also clearly north of 100 developers). It's easy to miss if you don't expand the medium comment section, but very much worth reading.


> A monorepo will serve you 99% of the time until you hit a certain level of scale when you get to worry about whether a monorepo or a polyrepo is actually material

If you worked in a company that had a core product in a repo, and you wanted to create a slack bot for internal use, where would you put the code? I assume not within your core product's codebase, but within a separate repo, thus creating a polyrepo situation.

So when you say a monorepo will serve you in 99% of cases, are you not counting "side" projects, and simply talking about the core product?


This article is too agressive and have a childish language that is not for my taste.


My last 2 jobs have been working on developer productivity for 100+ developer organizations. One is a monorepo, one is not. Neither really seems to result in less work, or a better experience. But I've found that your choice just dictates what type of problems you have to solve.

Monorepos are going to be mostly challenges around scaling the org in a single repo.

Polyrepos are going to be mostly challenges with coordination.

But the absolute worst thing to do is not commit to a course of action and have to solve both sets of challenges (eg: having one pretty big repo with 80% of your code, and then the other 20% in a series of smaller repos)


Jesus, this. Look, you're going to run into issues either way, because you're trying to solve a difficult problem.

It's like thinking OOP or functional programming is going to solve all your issues... I mean, in some limited cases they could, but realistically you're just smooshing the difficulties around and hopefully moving them to somewhere where you are more able to deal with them.

FWIW, I've worked in a many-repo org and it sucked worse than huge companies with monorepos and good tooling, but I'm not going to make some blanket statement because it depends on the specifics of your code/release process/developer familiarity etc.


This. Every decision is a trade-off. There is no silver bullet. Context matters.


Sounds reasonable. I'll have to add, though, that the underlying technology factors into this as well.

For example: If you're stuck with a TFS monorepo (you poor soul), you actually get to deal with both problems to some extent, since TFS doesn't enforce that you check out the intire repository at once.

This can have very "funny" situations because someone forgot to checkout new changes in some folder. OTOH, at least for releases, you can remedy this by using CI everywhere.


Hilariously misguided.

Pretty funny to read that the things I do every day are impossible.

Monorepo and tight coupling are orthogonal issues. Limits on coupling come from the build system, not from the source repository.

Yes, you should assume there is a sophisticated "VFS". What is this "checkout" you speak of? I have no time for that. I am too busy grepping the entire code base, which is apparently not possible.

If the "the realities of build/deploy management at scale are largely identical whether using a monorepo or polyrepo", then why on earth would google invest enormous effort constructing an entire ecosystem around a monorepo? Choices: 1) Google is dumb. 2) Mono and poly are not identical.


> then why on earth would google invest enormous effort constructing an entire ecosystem around a monorepo? Choices: 1) Google is dumb. 2) Mono and poly are not identical.

I think, once you've chosen a path of mono or poly, you have quite a challenge ahead of you to migrate to the other.

At that point, the tradeoffs arent based purely on the technical benefits - and "invest in monorepo tooling" may become a perfectly valid decision, as it's cheaper than "migrate to a polyrepo setup'.

I'm not arguing either way for or against monorepo, just pointing out that "must be a good idea because Google does it" is invalid - technical merit is just one of the thousands of concerns to be balanced.


This makes sense to me. If you're considering a monorepo with millions of lines of code used concurrently by thousands of developers, it's absolutely possible. You just need a handful of developers working on the infra to make it happen.

I do agree with GP though. I wish the author hadn't decided the things I do everyday are impossible.


> You just need a handful of developers working on the infra to make it happen.

With thousands of developers banging on the code base, it's going to be more than "a handful of developers". It's going to be at least a few "handfuls" of developers full time plus probably many, many other full time equivalents spread out throughout the whole user base (testing, supporting other users, etc.).


In reality, it's several hundred developers working full time on infra, and they are all overworked, and gradually falling behind. Monorepos at the scale of Google/Facebook are hard.

We aren't talking about maintaining Mercurial here - we are talking about developing a brand new distributed VCS that happens to be 'Mercurial-compatible', and deploying/maintaining it for tens of thousands of developers working simultaneously.


Development with thousands of developers is hard. The problems with monorepo and polyrepo are only subtly different. Either way you need a fairly large team just to handle the tools you need to solve your problems. Some of your problems will be because of repo organization (again, both choices have downsides that you need custom tooling to solve).

Note that most of your problems will be related to having a thousands of developers and repo organization is irrelevant.


I'd refine this to say "it's a good idea the way Google does it".


3) Google is committed to a monorepo to the point migrating away from it would be unpractical.

Truth is, ending up with a monorepo is _really easy_. It usually starts with something that doesn't even _feel_ like more than one project: backend code, frontend templates and some celery/whatever tasks, maybe some minor utility CLI tools. And this happens at the stage nobody wants to even _think_ about more than one git repository.

Once those are big enough, it's likely too late.

But hey, you can always claim _you wanted it that way_. My cats always look good while pulling that one.


I've worked with both monorepo orgs, and polyrepo orgs. I think if you have only used git as your VCS and not something like perforce, you're likely to get the wrong idea of how it works.

Both CAN work, but for internal organizations with a reasonably sized team, I've come to realize that a mono repo is better. You attain "separation" by establishing different views of the code/data and at scale, the mental model of what's happening is much simpler.


> I think if you have only used git as your VCS and not something like perforce

I think if you worked with Perforce, you're likely to get the wrong idea that people who dislike monorepos didn't work with Perforce. But the reality is that anyone who worked in this industry long enough did at some point end up traumatised by it, thanks.

> the mental model of what's happening is much simpler.

How does introducing the concept of "views" to the VCS model make anything simpler?


It can work. That doesn't mean it is a universal solution. And it doesn't even mean it is a solution that is guaranteed to cover most projects. Whether or not a monorepo works depends on a lot of factors. In my experience the number of cases where it doesn't work appears to outnumber the cases where it works.

It can work nicely when you have disciplined and demonstrably above average programmers that are good at structuring the internal architecture of systems and will know how to design for plasticity. It is also an advantage if all your code is written in the same style and doesn't come from a bunch of older codebases. But even then you can end up with messes that you will be likely to conveniently forget about.

For instance while clear decoupling was a goal when I worked at Google, it wasn't always a reality. There were still lots of very deep and direct dependencies that should never have been there.

It does not work well if you have "average" developers or if you have undisciplined developers or excessive bikeshedders (which kill productivity).

Then there is the tooling. Most people do not work for Google and do not have the ability to spend as much money and time on tooling as Google does. What Google does largely works because of the tooling. It would suck balls without it. To be honest: some things sucked balls even with the tooling. Especially when working with people in different time zones.

Google isn't really a valid example of why monorepo is a good idea because your average company isn't going to have a support structure even remotely as huge as Google. (If you disagre: hey, it's easy, go work for Google for a while and then tell me I'm wrong)


At some point blaze added the concept of visibility to control dependencies, so teams can whitelist users of their code. Though you can always comment it out while developing locally.


“why on earth would google invest enormous effort constructing an entire ecosystem around a monorepo?”

Didn’t google have a monorepo before git was created? And was created by academics? Legacy and momentum have a strong influence on the future. Hasn’t google also built a lot of tools for the monorepo and dedicates employees to it? That’s exactly the issue this article is about.

From an external perspective, the speed and scale of product rollouts from the bigger tech companies is very slow. I don’t know if the tooling has much to do with it, but I suspect it might. I’ve heard some horror stories (some from here) about how it takes months to get small changes into production.


If you think companies like Google are slow, you should look at the enterprise world. I do some technical due diligence work from time to time, and although things have gotten better of the years it wasn't that long ago that there was a more than 50% chance of seeing companies that only have source code repositories at a team level and 20% chance of them not even using a VCS to manage code.

I work for a company now where top management doesn't even understand what a repository is and what role it plays in software development.

Yes it is that bad in much of the "entreprise" world.

Yes, Google had a monorepo before git was in widespread use. They used Perforce while I was there, which was a miserable, miserable experience. It only worked because they poured engineering effort into making it somewhat tolerable.

I think it would be wrong to say Google chose a monorepo because it was the best choice. To be honest, I don't think they really planned how to deal with many thousands of developers when they made the choice. They just did what seemed to make sense at the time and then had to make it work as the challenges started to mount.


Does Google require more engineers to support their build system than they would with a polyrepo? That question is not trivial to answer, IMO.


Any is more than 0 though. In my experience (probably shared by many devs), polyrepos don’t require a team, or even a single person, dedicated to version control. It’s a minor part of the software management (usually: “mind if I create a new repo for this?” “Yes/no”).

It does affect dependency management but no more than any external dependency.


A polyrepo setup at Google's scale would pretty obviously require some dev work. For example, their CI/build story would be way more complex.


While that may be true, I'm not convinced it is a given. Any complicated enough monorepo requires complex CI/build tools, and Bazel/Blaze exist for a reason ...


> Any complicated enough monorepo requires complex CI/build tools

Any complicated multi-repo setup requires tooling, processes, procedures, cross-repo PRs & issue tracking, &c. &c. &c.

The question is: which requires less cost in order to deliver business value? In my experience, on the teams I've been so far, the answer has been monorepos — but I don't know everything.


At Google scale, you'd either need tooling for automated version bumps, or some other infra to manage versioning.

You'd need cross repo bisection.

You'd need a way to run all tests in all repos reflecting a new change.

There's 10s or hundreds more frs I could list.


“You'd need a way to run all tests in all repos reflecting a new change.”

You really shouldn’t have to run every test on every product. Or really any other repos. Use semantic versioning, pin your dependencies, don’t make breaking changes on patch or minor versions.


Pinning your dependencies is an antipattern (or at least in the eyes of many people who support monorepos it is).

It results in one of three things:

1. People never update their dependencies. This is bad (consider a security issue in a dependency)

2. Client teams are forced to take on the work of updating due to breaking changes in their dependencies. If they don't, we're back at 1.

3. Library teams are forced to backport security updates to N versions that are used across the company.

But really, the question to ask is

>don’t make breaking changes on patch or minor versions

How can you be sure you aren't breaking anyone without running their code? You can be sure you aren't violating your defined APIs, but unless you're perfect, your API isn't, and there are undocumented invariants that you may change. Those break your users. Monorepo says that that was your responsibility, and therefore its your job to help them fix it. Polyrepo says that you don't need to care about that, you can just semver major version bump and be done with it, upgrading be damned.

No semver means that you, not your users, feel the pain of making breaking changes. That's invariably a good thing.


At AMZN, which has 1000s of separate repos, 1) was the general case, with 2) occurring whenever there was a critical security issue in some library that no one had updated for years. The resulting fire drill of beating transitive dependencies into submission could occupy days or weeks of dev time.


When you change a project, you have to test the effect of your changes on downstream dependencies. Semver is wishful tthinking. Even a change that _you_ think is non-breaking could break something. Saying "well the downstream team was using undocumented behaviour so it's their fault" doesn't really hold much water when your team is Driving Directions and the downstream team is Google Maps Frontend and your change caused a production outage.


The nice thing about polyrepo is that each repo doesn't care what the others are doing. They can use whatever tools they prefer (whatever is least work and most familiar for the team, perhaps). It might also encourage better documentation and adherence to good practices with regard to deprecation and maintenance.


> I don’t know if the tooling has much to do with it, but I suspect it might.

As a development team grows, time to market also grows, in a superlinear fashion. This is known since people shared code on dead-tree pages, so the odds of tooling being the cause are low.


3) A monorepo with significant investment in ecosystem and tooling is a better choise than a polyrepo

For other (smaller) companies, polyrepo might be the better choice because [significant investment in ecosystem and tooling] is not appealing, and the investments of Google et al. have not leaked through sufficiently into general available tools. Some headway is being made in the latter [1], so monorepo might be the "obvious" best choice in 10 years or so.

[1] For example, Git large file support is mostly from corporate contributors https://git-lfs.github.com/ https://github.com/Microsoft/VFSForGit


> For other (smaller) companies, polyrepo might be the better choice because [significant investment in ecosystem and tooling] is not appealing

That's not the choice, though: significant investment in tooling is a function of codebase size. In my own experience, polyrepos require more tooling, because you're not just dealing with files & directories, you're also dealing with repos (& probably PRs & issues & other stuff in a forge).


> In my own experience, polyrepos require more tooling

That's not my experience. In my experience, polyrepo's significantly reduce complexity for a medium (30 developers) project.

An example: the following things are good software development practices if you work with a master-PR branch model:

  1. Tests must pass on CI before merging a branch to master
  2. Before merging a branch into master, the latest master must be merged into the branch so that tests are still reliable
This quickly becomes untenable if 30 people all commit to the same repo. By the time your PR is reviewed, it's outdated. So you merge master into your branch. By the time you come back to check your test results and merge, its outdated again. Repeat until 6 PM.

So you need partial builds to keep build time low, and would probably like to amend 1 & 2 with "unless your code has zero overlap with the changes in master". These are not standard features of any CI system I know of, hence the need for tooling.

Instead of tooling, polyrepo's provide the above benefits out-of-the-box. Just set your CI to build the repo, and it will do partial builds and PR-merging is uninfluenced by other repos. This is a huge advantage over monorepos.

The downside is that if your repos have tight coupling, you'll need simultaneous PRs in more than 1 place or need to look up history/files in more than 1 repo. If this is more than a rare occurence, this downside is so large that polyrepo is not a suitable solution for your project.

The projects of this size I've worked with did not have this problem, or the problem was solvable without much difficulty.


Another thing is that the OP categorized "medium sized menorepos" as being too large to fit on a laptop. Maybe the companies I've worked for is limited (mainly startups & consultancies), but the entire codebase, after several years of operation, still easily fits on a single laptop.

Of course there is a threshold, however this is typically a concern of a large organization or an organization that has been producing software for a decade or more.


> As described above, at scale, a developer will not be able to easily edit or search the entirety of the codebase on their local machine. Thus, the idea that one can clone all of the code and simply do a grep/replace is not trivial in practice.

Yeah this is a pretty widespread and fundamental misunderstanding that leads to a lot of bad policy decisions.

If 'grepping code' is your first resort then you're hitting things with a hammer. I'm writing code that a machine is supposed to understand. If the machine can't understand how the bits interact then I have much bigger problems than where my code is stored. Probably we're dealing with a lot of toxic machismo bullshit that is hurting our ability to deliver.

If you want discipline, if you want cooperation, hell if you just want to be able to hire a bunch of new people when you land a big customer, you need some form of support for static analysis and the code navigation that it enables. Stop the propeller heads from using magic and runtime inference to wire up the parts of the system, or find a new gig. Even languages where static typing isn't a thing have documentation frameworks where you can provide hints that your IDE can understand (ex: jsdoc for Javascript).

For a large team, working without any kind of static analysis is a recipe for a rigid oligarchy. Only people who have memorized the system can reason about it. Everybody else who tries to make ambitious changes ends up breaking something. See what happens when you trust new people with new ideas? New is bad. Be safe with us.

And even if by some miracle you do make the change without blowing stuff up, you're still in the doghouse, because we have memorized the old way and you are disrupting things!

Some crazy ideas work well. Some reasonable ideas fail horribly. To grow, people need the space to tinker and an opaque codebase ruins those opportunities. Transparency is also helpful when debugging a production issue, because people can work in parallel to the people most likely to solve the problem (even the person who is usually right is way off base occasionally). I should be able to learn and possibly contribute without jamming up the rest of the team by asking inane questions.

You need pretty good but entirely achievable tooling and architecture to get that, but man when you do it's like getting over a cold and remembering what breathing feels like.


At least the author gave us the courtesy of italicizing his broken assumption from the outset of the post.

> Because, at scale, a monorepo must solve every problem that a polyrepo must solve, with the downside of encouraging tight coupling, and the additional herculean effort of tackling VCS scalability.

Right.

But you have to get to "scale" first (as it relates to VCSs). Most companies don't. Even if they're successful. Introducing polyrepos front loads the scaling problems for no reason whatsoever. A giant waste of time.

Checkmate! I didn't even need a snarky poll. The irony of that poll is that it clearly demonstrates his zealotry, not other people's.


The author talks about proponents of monorepos, but I thought when I read it: actually they are victims of monorepos trying to explain to themselves as much as anyone why they choose to suffer with them. (Actual reason: for $$$).

Nobody would choose to drag around every historical afterthought in the development sequence of long forgotten software going back three decades that no longer builds with current tools, just so they can work on a small library off in a corner. Software is getting written and added to these monorepos at a much faster rate than hardware and networks are able to hide the bloat-upon-bloat growth of them.


>Nobody would choose to drag around every historical afterthought in the development sequence of long forgotten software going back three decades that no longer builds with current tools, just so they can work on a small library off in a corner.

If it doesn't work then it should be deleted. If it's still running somewhere then it should be maintained. Presumably you have a CI system so the monorepo actually requires everything in it to build.

In my experience, it's polyrepos that allow for dead and un-maintained code to just sit there for eternity. You forget about that unused repo right until the moment the service it deploys to (if you can track down that dependency) needs an update or goes down. Monorepos can more easily force system wide CI that checks for broken dependencies or other issues.


That old code worked 30 years ago... We have a closet in our office with a computer with Windows XP no service packs, and whatever compiler was used to build at the time. If we every have to release an update for whatever we were shipping we can do so. (assuming that computer still boots after all these years...)

In the embedded world supporting software for 30 years is not unheard of. We avoid it, but it is in the back of our mind that someday we might have to release an update. Fortunately 30 years ago nothing was internet connected, we are worried that we might be releasing security updates for our current products 50 years from now...


Bloat and the failure to remove broken/bad legacy code have nothing to do with monorepos.

"Hey, what does this server do?"

"No idea; it hasn't been touched in years. What's deployed to it?"

"Some 'foobar-ng' thing, never heard of that. Says it was last updated 5 years ago. Pull up our source repo for that package, will you?"

"Hang on, we've got like 30 services with names containing 'foobar', let me find that one . . . oh god. You don't wanna know."

"Fuck it, I'm just gonna shut this server down and remove that ancient, dead, busted package."

"The main billing system just broke! What did you do?!"


You can't split monorepos after the fact, at least not without immense costs. You can always just put all your small repos into a big one.


> You can't split monorepos after the fact, at least not without immense costs.

That has not been my experience at all. At a previous employer, we did exactly that with a multi-language library. In fairness, having multiple languages enforced fairly good directory structure in our single repo. But isn’t that the real point: good structure makes life easier, period. The thing is, going into a project you often don’t know what the right structures are yet. Creating a new repo for each component you think you need ossifies those choices, making it far more difficult to walk back on them later on (first because you may not even see the architectural mistake, second because the maintainers of that component will have an investment in its existence).

> You can always just put all your small repos into a big one.

In my experience, that’s harder, precisely because over time so much tooling has been built into each repo to manage builds, images, deployment &c.

I’ve worked in monorepos & I’ve worked in multrepos, and so far my experience has been that monorepos enable faster velocity and more-maintainable software. I’ve not (yet) worked at Google- or Facebook-scale, though, and I’m completely open to the idea that at that scale a team really does need lots and lots of repos, and tooling to stitch them all together.


I think there's a nuance to this that should be pointed out: Monorepos allow you to do very bad hacks (I need this other component over there; let me just put in a Symlink. Done.). And if people can, they will use those hacks.

If you split your repo up from the get go, the worst thing you can get that you'll have to assemble multiple distinct, well-encapsulated (in terms of project structure) things into one. In Git, that could lead to multiple root commits, but that's about it.


No. The worst case is that the engineering team spent more time working on “well encapsulated projects” than on the most important project for their business and are all now out of jobs. Most companies don’t fail because of tech debt. And certainly not because of version control tech debt.


Exactly. Whenever I see an engineer take a hardline position (eg: "no monorepos you zealots!") I always ask myself: is this person just annoyed?

Most of the time they're just annoyed.

One side effect of every successful business are annoyed worker ants that are sick of dealing with growth problems. I've been there. I know how annoying it can be.

Personally I've found comfort in embracing the chaos and learning to manage it responsibly. No dogma. No absolutes. Know how to do monorepos well. Know how to do polyrepos well. Learn the pitfalls of both. Don't assume other people are stupid zealots.


I agree. Any article (like the linked one) that states one side of a case as an absolute without giving any exceptions or caveats is going to be greeted by me with scepticism. Particularly as he keeps mentioning 3 large engineering organisations that disagree with him.


Not exactly. (At least small) companies can go out of business because of bugs. And one great way to "achieve" said bugs are implicit dependencies hidden from developers that didn't introduce them.

> The worst case is that the engineering team spent more time working on “well encapsulated projects” than on the most important project for their business

I'm not really sure how I should read this. Don't you use your repos to solve business problems? Why should that change because of the repo layout?


If you do a poly-repo approach from the start, and have dependencies between repos, you need to introduce component versioning from the start. Component versioning doesn't solve any business problems, but requires engineering effort.


Small businesses are not going to go out of business because of bugs unless those bugs aren't addressed. They'll go out of business because of poor sales and product management. Different things.


> Monorepos allow you to do very bad hacks (I need this other component over there; let me just put in a Symlink. Done.).

I've seen that with polyrepos as well: The entire project would require you to clone the individual repos into a specific directory structure so that things would work (no, not even submodules).


> Monorepos allow you to do very bad hacks (I need this other component over there; let me just put in a Symlink. Done.).

Why would you put in a symlink? You could just provide a path to the actual component and import it into your project.

> the worst thing you can get that you'll have to assemble multiple distinct, well-encapsulated (in terms of project structure) things into one

When you have multiple repos, you also have multiple versions and releases of things. Now every team has the following options:

1. Backport critical fixes to every version still in use (hard to scale)

2. Publish a deprecation policy, aggressively deprecate older releases, and ignore pleading and begging from any team that fails to upgrade (infeasible - there'll always be a team that can't upgrade at that moment for some reason)

You also have to solve the conflicting diamond dependency problem. This is when libfoo and libbar both depend on different versions of libbarfoodep. It's even more fun in Java because everything compiles and builds, but fails at runtime. So now you have to add rules and checks to your entire dependency tree - some package managers have this (Maven), others don't (npm IIRC).


> Why would you put in a symlink? You could just provide a path to the actual component and import it into your project.

Where do I need to put the path again? Ah what the heck, I'll just add a symlink inside a folder that's already somewhere in the build definitions.


What language and build tool is this that you're using?

I don't know anyone who has abused Maven or Cargo or Go like this. And I don't imagine Visual Studio Solutions for C# are used like this.

Is there an underlying disagreement based on JS/Ruby/Python scriptish coding (which creaks when a lot of developers work on it) vs C and C++ (which have astonishingly bad build system stories) vs big-iron languages that don't sweat when in a monorepo.


> And I don't imagine Visual Studio Solutions for C# are used like this.

At my workplace, we've just been cleaning up a whole bunch of instances of exactly that anti-pattern. Except that it's obviously not symlinks (which require specific user rights on Windows), but links to external files in VS.

Same problem, though: They're easy to introduce and a pain to deal with later on.


I have never done such a hack myself, but I've seen it. Mostly in C++ projects ;)


Could a VCS simply blacklist symlink files? Plus if you have developers doing crap like this (and their colleagues letting it slide in code reviews) you have problems that can't be solved by monorepo vs polyrepo. You have an engineering culture problem.


No VCS will ever protect you from crappy code though.


That's true, but we should at least make it as hard as possible to write crappy code ;)


> You can always just put all your small repos into a big one.

It's not quite as simple as that. You'll need to avoid rebuilding the entire repo for every change - using something like Bazel. This means your build tooling has to be replaced entirely, which is a non-trivial task and not something your devops/release engineering team will thank you for.

For any 3rd party libraries used by your projects you need to either ingest those projects into your monorepo and update forever. Or keep npm/maven/pip/gem/whatever around just for managing 3rd party dependencies (+ whatever system you use to front the main language package registries, because of course you're not talking directly to NPM/Maven central are you? What if they go down or do a leftpad?).

I think either system - monorepos or polyrepos - works fine; just pick one and stick with it. Monorepos will probably give you better velocity starting out. Past a certain size, which most software shops will never hit, the already-available tools lend themselves better to polyrepo. And more devs know polyrepo tools (eg. Jenkins) than the corresponding monorepo solution (eg. Bazel). Things might swing in favor of monorepo on the VCS side if Twitter/Google/FB ever open-source their stuff.


Eh, if you small repos build separately before you put them into one big repository, they'll build separately after, even if you just have a Makefile in each. If building your software depended on building its parts first, you already have the tooling to do so.


This really does not work for all languages. Speaking from personal experience, trying to monorepo multiple JS projects managed by NPM, it's more work than that.


I know neiher JS nor NPM. Could you elaborate a bit? What's the difference between having your code in folders A B C, each their own repo, and having your code in folders A B C, subfolders of some monorepo? Does NPM try to do something clever with your VCS?


>not something your devops/release engineering team will thank you for.

It's their job. If they actively don't want to do work then you probably made a hiring mistake somewhere. By that logic what DevOps really wants is to the company to shut down since then they'd have none of that tedious work to do.


Release engineering is a conservative profession. They won't thank you for upsetting all their established processes, introducing new software and systems that have to be maintained, and kept running just because you don't like how your code is laid out.

It's harder to make a business value case for this type of change - there are only vaguely worded promises of "improved developer velocity". Contrast that with a change that automates or makes faster some aspect of building and releasing - a professional release engineering team would be all over it because they can demonstrate value in that work.

In any company beyond 50 people, there are multiple engineering managers, directors or the VP of engineering that will need to back this initiative to make the release engineering team do it. It's really not as simple as "dump all the code in one repo". I'm speaking from experience.


They will have to work the entire Christmas break to pull off the switch. That will not make them happy. Rolling out massive changes like this is not easy, it needs to be planed in advance, tools built, dry-runs run, and then the final move. It also requirement management to not schedule any releases for several weeks before or after the change. As a member of a tool team I wouldn't think of doing this level of change when other people are in the office which means I miss my Christmas holiday.

Sure it is their job, but is isn't an easy job and there are many opportunities for things to go wrong. It might or might not be the right choice for you, but don't overlook how hard it is.

Note that the above applies for going in either direction.


You're making a lot of assumptions about how such a move would be done which I don't feel are warranted. You're picking the hardest most painful option and then using it to claim the process is painful rather than that the option you chose is painful.

If I was moving many small repos into a single mono repo then I'd do it one repo at a time. Presumably your small repos are independent entities so there's no reason to do a single massive switch. Transition each repo to the new build system inside the existing repo. Once that works then you can transition that repo into the mono-repo and tie together the build systems. No need to stop releases, no massive chance of everything failing, no weeks of debugging while the world is stopped, etc. Rinse and repeat until everything is moved over. Process becomes more optimized and less error prone with each repo that is moved over.


Now I lose lots of weekends making each separate move.


Why weekends? Small moves means you do it during the week and during regular working hours. Done in branches with CI support so that it's pretty unlikely to break anything. If doing regular releases require you to be spending your weekends then you have bigger issues to fix first. Spread it over however long it takes. Why are you trying to make your life more difficult?


I have done this. Regardless of when you do the move everybody who works in the repo being moved is down for several hours at best if you do it during working hours.


It's funny you say that, because the currently top-voted comment says exactly the opposite: https://news.ycombinator.com/item?id=18811368


> You can't split monorepos after the fact, at least not without immense costs.

Sure you can. The difficulty of doing so depends on many (many) factors. If your team does their job well then the costs won't be immense. It might be annoying, but not that hard.

Speaking in absolutes or platitudes solves nothing. Sometimes monorepos make sense. Sometimes polyrepos make sense. It's entirely dependent on what your company does.


Of course, if everybody is very diligent in keeping things in the monorepo distinct and independent, then it's easy to split it later on. But relying on constant diligence doesn't work out in the long run in my experience.


It's not like polyrepos solve the long term diligence problem though.

E.g. if relying solely on a package manager to to keep coupled things in lock step, you need make sure that version numbers are kept up to date for every little change made to every library.

You can easily end up with a situation where someone in another team makes a small change but doesn't change the lib version number. That's a people issue but it does happen.

You can get round that by using a repo SHA but now you two things to keep up to date for every library.

Like wise you'll have to be diligent in versioning APIs. Anecdotally I've found it easier to keep things in lock step when in a single repo and using a single pull request for each story than I have where separate teams have to keep separate repos in sync.

Both work but the monorepo approach worked better for the projects I've worked on. It just lead to less moving parts and more repeatability when there's a single SHA to watch.

I also have been luck enough that I haven't worked on a project so large that we couldn't build a monorepo on a single machine with "normal" build tooling.


There's a lot wrong with this article. Most of the arguments are either not backed up or are misleading. I haven't heard anyone argue they can drop dependency management because of a monorepo.

The author lists downsides of monorepos without listing the upsides and downsides of polyrepos so its really half complete.

I don't think anyone who likes a monorepo is suggesting you just commit breaking changes to master and ignore downstream teams. What it does do is give the ability to see who those downstream teams (if any) might be.

The crux of the author's argument is that added information is harmful because you might use it wrong. Its just as easy (far easier in fact) to ignore your partners without the information a monorepo gives. Its not really an argument at all. There's really nothing here but "there be dragons".

Monorepo's provide some cross functional information for a maintenance price. Its up to you whether the benefit is worth the overhead.


"... Please don't" titles also give off a condescending vibe, which usually means the author has erected strawmen, is appealing to emotion, & has not thought things through.


Seems like the main point is that you'll still need to add additional tooling (search, local cloning, build, etc) to handle scaling, something you can do just as well with polyrepos. Conversely, for polyrepos, you can add tooling to fix issues with dependency management and multi-project changes/reviews. However, the author figures that monorepos engourage bad code culture and points out that Git is hard to build a monorepo on.

To me this message seems a bit shallow, of course we can build tooling to hide the fact that we have a polyrepo. Given well enough built tooling and consistent enough polyrepo structure (all using same VCS, all being linked from common tooling, following common coding standards and using the same build tooling, etc.) the distinction from having a monorepo is more of an implementation detail.

Given the choice between a consistent monorepo where everyone is running everything at HEAD and a polyrepo where each project have their own rules and there's no tooling to make a multi-project atomic change, I'd go for the former.

Given the choice between identical working environments but different underlying implementations I would go for whatever the tools team think is easier to maintain.


What is the tooling for multi-repo atomic synchronized commits? Monorepo's give you that for free, which is the reason why I think monorepo projects exist. SVN kind of gave you partial checkouts, which was helpful.


Polyrepo argues that is a non-feature and don't give it to you. You can figure out where things are, but you never get synchronization.

This is a good thing because when you have to make the multirepo commit you make the change and then update each downside one at a time. Each change is much smaller and so easier to review (and also easier to find the right reviewer).

Of course the downside is you either have to maintain both ABIs (not just API), have a rollout scheme with two version of the upstream library exist side by side, or don't release.

Nothing is perfect.


Yes I think so too. But of course, as the article points out, nothing is entirely free. At some point we will have to build tools to handle scaling, and then the trade offs between a mono and polyrepo becomes less obvious. I'd lean towards monorepos as a base either way, but given sufficiently well working tooling it might not matter much.


> Given well enough built tooling and consistent enough polyrepo structure (all using same VCS, all being linked from common tooling, following common coding standards and using the same build tooling, etc.) the distinction from having a monorepo is more of an implementation detail.

Exactly. Sure, you can manually recreate a monorepo from a multirepo system, but … why do that? That takes software engineering effort that you could spend on your product instead.


I’ve found monorepos to be extremely valuable in an immature, high-churn codebase.

Need to change a function signature or interface? Cool, global find & replace.

At some point monorepos outgrow their usefulness. The sheer amount of files in something that’s 10K+ LOC ( not that large, I know ) warrants breaking apart the codebase into packages.

Still, I almost err on the side of monorepos because of the convenience that editors like vscode offer: autocomplete, auto-updating imports, etc.


Monorepos and packages are not mutually exclusive. You can and should have many different projects in subfolders I'm your monorepo, each with their own builds and tests and artifacts (though hopefully somewhat standardized). The point is that now it's easy to release changes across multiple projects, integration test between them on a specific global patch, etc, without a whole pile of complex tooling.


Agreed. When I wrote the parent comment, I was thinking of a time I prematurely abstracted an API wrapper to a private git repo and how painful simple, frequent changes were.

Though, as you say and I commented below they’re not mutually exclusive. A wrapper or even an entirely separate service can exist alongside others.

One dark side of this is being able to “reach inside” other parts of the monorepo and blur application boundaries.


This guy gets it. Software Engineering is about using the appropriate tools and techniques for the task at hand. If your repo gets so large it can't be comfortably checked out, something needs to get split apart.

Monorepos are also a great technique for tackling large legacy codebases. When the rot is all in one designated place, it becomes easier to encourage good developer habits on new code created in new, separated repo(s).

Speaking from experience I've worked on a team operating through a monorepo project that came out real well. The codebase was mostly golang, so everything lived in the GO_PATH, but for the most part the typescript in the UI side of the repo didn't complain. Testing and code quality was a higher priority, as well, which may have contributed to its success.

I have also worked on a monorepo project that had minimal tests and automation, that soon grew monstrous and ultimately needed refactoring. That was a big pile of coffeescript, es6 and java that ultimately refactored into three different node modules and two microservices.

Javascript and its module packaging tends to conform better to polyrepo patterns. golang code all wants to be in the same place, and java repos have their own desired nested directory structures. These two languages tend to encourage monorepo design patterns.

Monorepo or Polyrepo, the correct answer is whatever works for your team and task at hand.


Hold on, are we talking about monorepos, ie a set of projects with shared change history (and possibly 'build it all' type tooling) or single monolithic apps?

I'm seeing these two things conflated in this thread.


To me, a monorepo exists of a set of related or semi related services or runtimes that can operate autonomously, but have a dependency on their siblings to operate correctly.

In some cases, this could be two separate backend projects where you want to re-use the same deployment pipeline.

Often, I find that API wrappers are something that I share across frontends and backends in the JS world, so it often makes sense to separate my projects into:

- backend

- frontend

- common

In Typescript I really like this pattern and can namespace shared types so that it’s very clear to the future reader that this type is probably used outside of the current context.

So, to reply to your comment — I think the term “monorepo” can encompass a lot of different project types.

I think Dan Luu covers the bases quite well here:

https://danluu.com/monorepo/


In fairness, a single repo does encourage a monolithic architecture (even though one can have multiple binaries inside a single repo), just as a monolithic app does encourage a lack of modularity (even though one can write a single app composed of well-chosen modules).


The biggest gripe I have with modern day monorepos is that people are trying to use Git to interact with them, which doesn't make a tremendous amount of sense, and results in either an immense amount of pain and/or the creation of a bunch of tools to try to coerce Git into behaving basically like SVN.

Which of course begs the question, rather than trying to perform a bunch of unnatural acts, why not just use SVN to start with? It works extremely well with monorepo & subtree workflows.

Sure it has some warts in a few dimensions around branching, versioning, etc. compared to Git when using Git in ways aligned with how Git wants to work, but those warts are minimal in comparison to what's required to pretzel Git monorepos into scaling effectively.


Maybe its just that the author's cutoff is at the wrong team size, but the monorepo I work on (with ~150 devs) has almost none of the problems presented.

Unreasonable for a single dev to have the entire repo? I'm looking at a repo with ~10 million LoC and ~1.4 million commits. I have 74 different branches checked out right now. Hard drives are cheap.

Code refactors are impossible? I reviewed two of those this morning. They're essentially a non-event. I'm not sure what to make of the merge issue - does code review have to start over after a merge? That seems like a deep issue in your code review process. The service-oriented point seems like a non-sequitur, unless you're telling me I'm supposed to have a service for, say, my queue implementation or time library.

The VCS scalability issue is the only real downside I see here. And it is real, but it also seems worth it. It helps that the big players are paving the way here - Facebook's contributions to the scalability of mercurial has definitely made a difference for us.


In theory, yes - if the underlying repo changes, code review should start over. In practice though, it's a terrible idea ;)

Part of code review is to ensure the code "fits" with all other merged code - so a re-review is "needed" when other changes merge. E.g. if I merge a refactor that changes everything from Pascal case to 100% SHOUTING, reviews now need to take this into account.

In practice, this doesn't happen - it's way too much effort for far too little value.


I think the trick is to only re-review the areas that had merge conflicts, and to do the re-review aware of both the changes you already reviewed and the changes that caused the conflict. Merge conflicts, even in big code refactors, are fairly rare, so this ends up not being much additional work in practice.


> if I merge a refactor that changes everything from Pascal case to 100% SHOUTING, reviews now need to take this into account.

To be fair, if you get away with merging that refactor, the review that needed more attention was of that refactor ;-)


I do really like mono-repos, but google's other significant new project: fuchsia - is set-up as multi-git repo (and I believe chromium too, maybe android (haven't checked)). For fuchsia, they use a tool called "jiri"[1] to update the repos, previously (and maybe still in use) is the "gclient" sync tool [2] way from depot_tools[3]

[1] - https://fuchsia.googlesource.com/jiri/ [2] - https://chromium.googlesource.com/chromium/tools/depot_tools... [3] - https://chromium.googlesource.com/chromium/tools/depot_tools...

It even reflects a bit to the build system of choice, GN (used in the above), previously gyp, feels similar on the surface (script) to Bazel, but has some significant differences (gn has some more imperative parts, and it's a ninja-build generator, while bazel, like pants/bucks/please.build is a build system on it's own).

Simply fascinated :), and can't wait to see what the resolution of all this would be... Bazel is getting there to support monorepos (through WORKSPACEs), but there are some hard problems there...


Having worked with some organisations building on Android (>1,000 repos), life is not easy when you are trying to build on top of it and regularly take updates etc.

I asked one company how many changes required changes to more than one repo and was told "a small percentage". We then did some basic analysis of issue IDs across commits and discovered that it was in reality nearer 30% of changes. Keeping those together was just plain very hard.

Start to scale this by teams of hundreds or thousands of devs and you get a lot of pain.

Managing branches is also hard - easy to create (with repo tool) - but hard to track changes.


Funny that there are so many reimplementations of git submodules but with support for "just give me HEAD" - Google has two (jiri and repo), my company has a home-grown one too.


Android uses a top level repo that behaves as a monorepo with thousands of submodules inside. It's also designed for multiple companies sharing code and working with non shared code at the same time which introduces some constraints and challenges.


"Bazel is getting there to support monorepos" -> support multiple repos (was tired when I wrote this I guess)...


My polyrepo cautionary tale: Two repos, one for fooclient, one for fooserver, talking to each other over protocol. Fooserver can do scary dangerous permanent things to company server instances, of which there are thousands.

Fooserver sprouts a query syntax ("just do this for test servers A and B"), pushed to production. Fooclient sprouts code that relies on this, pushed to production. A bit later, Fooserver is rolled back, blowing away query syntax, pushed to production. "Just do this for test servers A and B" now becomes "Do this for every server in the company". Hilarity ensues.


Ouch. I suppose the lesson is that a monorepo with both client and server being developed and tested together would have reduced such risk.


Versioning the client/server interface would've also reduced such risk.


Is there any examples of someone who actually maintained a monorepo for a massive company, who now says they shouldn't? It always seems to be "back seat drivers" against monorepo, not people with practical experience (that I can see at least)


I call bullshit on "our repository is too big for one machine".

Seriously, you have over 1 TB of code and 100 people wrote it?


adding raw versions of binary assets (designs, video, ...) can quickly lift a repo beyond a TB. Now, you could say "don't do that", but there's valid use cases where you'd want to track all binary assets as part of the development cycle.


Ouch, well, yes that is a very good situation in which not to take the "mono" part of "monorepo" too seriously.


or use a VCS that allows partial checkouts of repositories. There's no DVCS that I know of that can do that, but for example SVN can. Git LFS might be an option, too. There are also commercial products that target that market.

I just wanted to point out that reaching a measly TB of data doesn't require much effort. (worked on a product that would version rendered clips for special effect production).


You can do that with content, you just partition the workspace/view of the monorepo to what each person needs rather than checking the entire thing out git style.



You can have large repos and it not only be code. I remember seeing repos many tens of gigs because all of VS was versioned as well for "reproducibility".


In 2016, Google's monorepo was 86TB


The keyword there is "Google". Everything is different at the extremes.


Better title would be "Monorepos don't fit with my particular use case."


I strongly agree. I hate this style of blog post.

Telling people what they should or should not do is generally absurd. Every situation is unique and you can't possibly know another project's requirements or acceptable trade-offs.

A better approach, in my opinion, is "Here's what we did and why". The author clearly has experience in the area. Great! Tell me about your problems. Tell me about your attempted solutions and what did or did not work. Tell me what you wish you had done! I'd love to use knowledge of your situation to inform my own decision making.

But don't be surprised if my circumstances are different and lead me to prefer different trade-offs and choose a different solution. That doesn't make me a zealot or an idiot.


Isn't your own post telling people what they should and should not do (specifically on how to give advice)?


The irony wasn't lost on me. It's a fine line. Let me try a slightly different approach.

When I blog I've had much better luck telling people "here's what I did and why". I don't know your circumstances and can't tell you how to solve your problems. You may need to choose different trade-offs than I did. With that said, here is my problem, how I solved it, and what I learned along the way. Hopefully you can learn from my experiences and make a more informed decision for how to handle problems you may encounter.


I disagree. The thesis statement repeated several times throughout is "Monorepos don't scale in exactly the same ways that polyrepos don't scale, the tools to solve the scaling problems are the exact same except monorepos need more of them and encourage bad habits along the way."

You may disagree with that thesis, but it definitely seems to cover more than one use case.


...the author's title is literally "monorepos: please don't!"


To me, the key point is this: Splitting your code into multiple repos draws a permanent architectural boundary, and it's done at the start of a project (when you know the least about the right solution).

The upsides and downsides of this are an interesting debate, but there is a cost to polyrepos if you want to change the system architecture. There is a cost to monorepos too as argued by this post, and its up to the tech leads as to which cost is greater.


"The frank reality is that, at scale, how well an organization does with code sharing, collaboration, tight coupling, etc. is a direct result of engineering culture and leadership, and has nothing to do with whether a monorepo or a polyrepo is used. The two solutions end up looking identical to the developer. In the face of this, why use a monorepo in the first place?"

.....because, as the author directly stated, the type of repo has nothing to do with the product being successful. So stop bikeshedding, pick a model, and get on with the real business of delivering a successful product.


Could you get the best of both worlds by having a monorepo of submodules? Code would live in separate repos, but references would be declared in the monorepo. Checkins and rollbacks to the monorepo would trigger CI.


There's not much good to either world.

You need fairly extensive tooling to make working with a repo of submodules comfortable at any scale. At large scale, that tooling can be simpler than the equivalent monorepo tooling, assuming that your individual repos remain "small" but also appropriately granular (not a given--organizing is hard, especially if you leave it to individual project teams). However, in the process of getting there, a monorepo requires no particular bespoke tooling at small or even medium scale (it's just "a repo"), and the performance intolerability pretty much scales smoothly from there. And those can be treated as technical problems if you don't want to approach social problems.

To put it another way, we're comparing asymptotic O(n) with something bigger, neglecting huge constant factors on the former. There's a lot of path-dependence, since restructuring all your repos with new tooling is hard to appreciate.


This is the approach taken by a number of projects. The ones I am most familiar with are the OpenEmbedded/Yocto/Angstrom family that build Linux for embedded devices. They have a 'root' repo that references the layer repos (using metadata files rather than submodules), and there is a tool that does the pulling. It's optimised for pulling not committing though, I don't think the tooling helps much with bumping versions.

It can be misused though - the releases of the root repository reference the children by tags usually. Someone retagged a child repo and we suddenly had build failures.


We actually did this. When I started at Uber ATG one of our devs made a submodule called `uber_monorepo` that was linked from the root of our git repo. In our repo's `.buckconfig` file we had access to everything that the mobile developers at Uber had access to by prefixing our targets with `//uber_monorepo/`

We did however run into the standard dependency resolution issue when you have any loosely coupled dependency. Updating our submodule usually required a 1-2 day effort because we were out of sync for a month or two.


Polyrepos are the way to go:

- Semantic versions.

- Group components into reusable packages.

- Don't use git modules or other source cloning in builds, use native/platform package management.

- Access control is made much easier.

- Sign commits and tags.

- Code review either before- or after-the-fact, just do it(tm).

- Reproducible builds - strip out timestamps/random tokens/unsorted metadata.

- Create CHANGELOGs semi/automatically.

- Eliminate manual steps altogether.

- Distributed builds/build caching (distcc, ccache).

- TDD smoke tests should run automatically in dev on save with 10 seconds. Bonus points for running personal TDD sandbox on faster remote servers via rsync and trigger on file-save.

- Standardize on 1-3 languages.

- Services composed of simpler 12factor microservices, not monorepo megaservices. Deploy fuse switching, proxying, HA/redundancy, rate limiting, monitoring and performance stats collection just like macroservices.


About half of these aren't specific to polyrepos.

changelogs, reproducible builds, code review, signing, grouped components, distributed builds, TDD smoke test, standardized languages, and microservices are all possible (and just as easy) in monorepos.

You no longer need to worry about versioning, which means no manual updates of either your package or updating dependency versions. Although access control is more difficult, that doesn't seem like a good enough reason to make this kind of decision.


So the conclusion is "monorepo or polyrepo, you'll need a lot of tooling anyway. So why use monorepos?"

Very easy: because having everything in a single place is just easier to work with.


Easier because you can commit more atrocities.

Easier is not better, some things should be hard to do to dissuade you from doing them. Stop burdening your co-workers!

People in here keep saying "easy" like it's the end goal, but it's not. Correct is. Writing great software is hard, and monorepos make it even harder to do that because a monorepos encourages an "anything goes" vibe.


I don't understand how you can say that monorepos and correct cannot coexist. Don't you have the minimum that is code quality analysis, automated testing and mandatory code reviews in place? Those must exist to maximize correctness, and they must exist wheter you have 1 or 100 repos.

If I can commit more atrocities I can also commit more fine things, and I have infra in place to stop crap making it to master.

Like the article said: it's all about the tooling


I didn't say cannot, I said that correctness in a monorepo exists despite the monorepo, not because of the monorepo.


And conversely, polyrepo doesn't bring correctness just because it is poly.


It does promote it, however, due to the abundance of support, tooling, and how well it integrates with most processes and software available.

Being a special snowflake (monorepo) makes it a lot harder to write good software.


I have worked with polyrepo madness... I do remember doing commits to up to 5 different repos just for a feature. And to roll this feature to prod, few of these repos had to go through release process. On top of everything we couldn't really write tests to ensure if the feature works. The best we could do is write tests on the "user facing" repo and keep fixing and releasing others until those pass.

Well, I am sure many companies doing better than we did, with poly repos though.


Monorepo = all applications in one repo

Polyrepo = each application in its own repo

Whatever-madness-you-had-repo = each application across multiple repos

I'm sorry to hear you had to suffer what you did, but that was not the only alternative to a monorepo!


Does a lot of the pain from a monorepo come from trying to use a tool - Git - that is explicitly designed to support distributed repositories? Wouldn't things be easier if you used eg. Subversion instead? That is a tool that was designed around a client/server paradigm and had a single repository as its main use case.


Anecdotally - I worked on a ~1 million lines of C project in SVN for a number of years (it was a Bluetooth stack). SVN handled it (mostly) fine.


Git was designed for a monorepo (the kernel). When people talk about monorepo, they mean a single history line.


The kernel is not a monorepo. The kernel is a large(ish) repo.

A monorepo would be if you put the kernel AND userspace in the same repo (e.g. all the code for a Yocto distro). To me, when people talk about a monorepo they are talking about putting separate pieces of the architecture in a single repo.

It's a great example actually. If it was all in a monorepo and you could release it together then you wouldn't have to worry about the breaking userspace, you could make the changes to both sides at once. In practice because that's not how releasing works in that environment, you can't do that.


It is just the feeling I get when I see how it is used. That it is being misused.

If you have a corporate setting with a single Jira/Stash/Bamboo installation keeping all the company's stuff in a single central place. Then this design where git copies (clones) the full repository on the local machine just seems like a misfit and source of lots of pain (ie. just being slow and at times unstable, local repository not pushed back to origin, multiple history lines etc.).

It seems to me that things would have been better served with a more primitive client/server orientated tool like Subversion.


Can anyone here explain to me how a monorepo like Google or Facebook handles security?

If I pull the repo - I have the entire contents of Google or Facebook? Is that right?

Surely that lacks the normal security measures around what must be highly sensitive information, so there must be more to it than I know of?


(there's an acm paper about Google's repo that dives deeper into this).

First thing, you can't just "pull the whole repo" at Google or fb scale. It doesn't fit on a single hard drive.

This means the enitre repo is normally accessed via networked means. As a result, builds can also be done over the network transparently.

So building and testing is done as a different user. That user can have different privileges than the individual requesting the build.

So there is a way to hide source code so that only the output artifacts (compiled binaries) can be accessed.

But I think the other part of that is that that's normally a tiny minority of code.

The other option is of course to live outside the monorepo, as some projects do.


Link to the paper "Why Google Stores Billions of Lines of Code in a Single Repository" https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

Also, "Software Engineering at Google" https://arxiv.org/pdf/1702.01715.pdf


(Worked at google for 2-3 years, under google3 depot) - AFAIK, only few hundreth files are not visible to employees, and certain folks (contractors) maybe limited there too.

You can actually compile, and run in dev/staging certain things, inspect code, click on function, and see callsites, even "debug" from the browser (debugging, is pretty much, if your binary have stepped through a bookmark, then it'll tell you, and you may print out locals from the scope - like vars, etc. - hard to explain in short).

Then "checking out", to put in p4/svn - is not really like that - but you can find pretty much videos, docs explaining it. It's more like - you create a "workspace", "client" (piper has history from p4, so some terms are similar to perforce), and in this client - you "view" the entire depot, and changes you've made are "overlayed". Then you can submit these in a CL (much like perforce).

There is also git (git5) and hg mode, but I've never used them. I've used the "CitC" one (client in the cloud), and was great as I was able to edit, and later edit, even build,... even deploy all from the browser (though prefer real IDE there - like Eclipse/IntelliJ/CLion).


FTI: Apparently "google3" is the Google monorepo

https://news.ycombinator.com/item?id=17123620


Sensitive information like user data or certificates doesn't live in a source repository. It lives in databases and systems for managing certificates.

Treating straight source code as sensitive information is security by obscurity.


> Treating straight source code as sensitive information is security by obscurity.

For many companies, their source code is a large part of their business.


You aren’t allowed to clone google’s repo to a laptop.


I wish this had touched on polyrepos' ability to pin known-good versions of dependencies; that tends to be the Achilles' heel of monorepos.


Yet who unpins them or updates when a new good version is available?


There's a cost for being out of date, but there's also a cost for learning the hard way whether a new version breaks prod. Pay it down like any other tech debt.

Maybe I could test literally every release version of each of my dependencies, but that isn't really my job.


Greenkeeper (and similar systems) comes to mind, too, in the polyrepo case. You can still CI with "the latest" in the polyrepo case. We have the technology to automate that. Including situations like 'let me know when the next version of my dependency that passes this test is released and send me a PR to update my pinned version when it happens'.


The person responsible for dependency management in your team. You have one right?


NPM might not be the best package manager, but if you're using something like Lerna you can get the best of both worlds. Your local copy of an internal dependency can either be symlinked to the local source code, or a copy of the published package.

That makes it a lot easier to work on a package and its consumers at the same time.


Oh God! I had forgotten how much frustration was in my old Python team until I checked the docs and discovered that you can make pip "install" your local copy of one repo as the dependency for another. The poor developer who prompted me to check was testing by doing CI builds and pulling down the new eggs.

Of course this isn't really a point against polyrepos, since it had a solution, but it's definitely something that I could imagine catching out lots of juniors.


That can be good, or that can be bad.

Someone makes an incompatible change, but you do not find out months later, because the client of that incompatibility is not using the latest versions. In the meantime, the development of the module and its downstreams has marched past any sort of easy resolution, and you essentially are now maintaining two different copies of your code.


Hmm. Monorepos strongly favor "always-good". Core library breakage gets detected and rolled back really fast. It's been months since I've been affected by one.

Known-good is in the eye of the beholder, and is just another dimension that generates breakage.


google3 breaking was a bimonthly event that left the entire team aground. Bad enough that it happened, but the worst part is that we couldn't do anything about it but wait.


I guess my end of google3 hasn't been like that in a long time.


Visited a customer recently who had inherited a monorepo.

All their CI and release problems traced back to it.

At the risk of sounding like an old git, package coupling and package cohesion principles were defined for a reason.

I do feel like a lot of patterns in contemporary development are kneejerk reactions to how last generation's programmers did things.

Exceptions? Nah, multiple returns! Dependency management? Who needs it... Oh, wait.

Many small, single-responsibility repos? Wang it all in one, and then invent your own tooling to cope with it!


>Exceptions? Nah, multiple returns! Dependency management? Who needs it... Oh, wait.

I thought consensus was that exceptions, like OOP, is an antipattern. I guess there's room for different opinions. :-/


I would not assume there is _any_ consensus at all that object orientation (as defined by Kay, not by C++) is an anti-pattern.


There is a largely held opinion (that I share) that exceptions are a dated pattern that is better replaced by the more modern alternatives.

They are still much better than the older patterns they were created to replace, like multiple return.


Go uses multiple returns. Rust uses Result<T, E>


Consensus could be a synonym here for 'current trends'.

I see no advantage of returning error types over checked exceptions. I'm happy to be informed, though.


You obviously haven't written any win32 applications :)

Where a function may return zero, non-zero, not-ERROR_SUCCESS, INVALID_HANDLE, NULL, (there's more), to indicate an error. You then need to call GetLastError which gives you a number that you input into google to get a generic error message.

You "should" also wrap every api call in this error checking so you know where the failure began.


> Wang it all in one, and then invent your own tooling to cope with it!

Why don’t you believe that additional tooling is required to manage software whose sources are composed from multiple repositories?


Thing is, that's been solved problem in open source at least since Ivy.


Monorepos are way simpler for small teams to work with. At my startup we have roughly 10 services out of the same repo. It's much easier to "cut a release" across the entire system. It's much easier to share code internally, upgrade dependencies, etc.

For a larger company, it might not be a good idea. However, most startups start small and stay that way. Why take on the overhead you don't need?


I'm not familiar with how monorepos work in practice, but it seems obvious to me that it's going to complicate everyday tasks.

Ready to commit? Whoops, another team made a bunch of commits to their project, and you need to rebase your project before you can commit. (I'm having flashbacks to Clearcase already.)

Need to roll back the last two commits you made? Sure, that takes two seconds--oh, wait, another team made multiple commits that got interleaved with yours. Have fun cherry picking the files you want to revert.

Of course, I'm apparently a curmodgeon, because as soon as someone starts talking about running a find/replace globally across multiple projects, I want to grab something sharp.


See Hyrum Wright's talks on Large-Scale Changes at Google: https://www.youtube.com/watch?v=ZpvvmvITOrk (CppCon2014), https://www.youtube.com/watch?v=TrC6ROeV4GI (CppCon 2018), and paper by Googlers on ClangMR (http://www.hyrumwright.org/papers/icsm2013.pdf). It's great!

And if you'd like to learn more about how monorepos work in practice there's a couple of papers: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor... and https://ai.google/research/pubs/pub47040

(Also worth reading: http://danluu.com/monorepo/)


> Whoops, another team made a bunch of commits to their project, and you need to rebase your project before you can commit.

If their commits didn't affect you, then … it merges cleanly, and you don't care. If their commits do affect you, then … now you know. With multiple repositories, you will have no idea that they broke something you rely on.

Even better, with a monorepo, if they break something your existing code relies on, they' have broken the build, and they have to fix it. With multiple repos, another person or team is free to break things you rely on without even knowing it, and you won't know it until you update your dependencies hours, days, weeks or months later, wondering why everything is suddenly different.

If another team's work is tightly-enough interwoven with yours that their daily commits affect you, then you're all on the same team (and/or your architecture needs work).


> Ready to commit? Whoops, another team made a bunch of commits to their project, and you need to rebase your project before you can commit. (I'm having flashbacks to Clearcase already.)

If they merge cleanly, it's not an issue. If they don't, you need to fix the merge conflict. The work you need to do is proportional to the number of merge conflicts, which isn't special to monorepos.

> Need to roll back the last two commits you made? Sure, that takes two seconds--oh, wait, another team made multiple commits that got interleaved with yours. Have fun cherry picking the files you want to revert.

Again, only an issue if the changes are on the same files. It can be a bit of a pain to revert a stack of diffs, but if it's just a random commit with no other relevent commits to the file, very easy.


Yeah, the rebasing complaint isn't fair if you're using a modern VCS.

I used to work on a large team at Cisco Systems that used Clearcase. Clearcase does not do merges. If anything has changed in master, you have to check out again, which obliterates all local changes.

(I have never met a developer who liked Clearcase. It was built to simplify life for system administrators and to tick the right boxes for management, not to be useful for developers.)

My general VCS experience is that you can't roll back a commit without also rolling back all subsequent commits, related or not. I'm glad to hear that modern systems have fixed that. (It looks like even Subversion does that now, cool!)


Why would you need to rebase or cherrypick unless you and the other team were touching the same files?


Problem is, places that use monorepos also tend to have whole teams full of people who feel entitled to f#ck with stuff across the entire repo, often without going through the normal review or other processes for each component. Thus it's not uncommon to commit code early in the day so that you can build packages for system test, find a problem in system test, then come back later the same day to find one of those randos has already "fixed" your code to adhere to some new standard they decided on at lunchtime. Now reverting only your own commit is a mess, and reverting theirs as well invites a s###storm of epic proportions because they're higher-caste than you.


> places that use monorepos also tend to have whole teams full of people who feel entitled to f#ck with stuff across the entire repo

That's only true to the extent that the statement 'places that use multirepos also tend to be full of people who feel entitled to f#ck with stuff across the entire codebase' is.

Bad colleagues can cause damage either way. I appreciate that with a single repo, bad colleagues are at least forced to have passing tests after their changes, rather than leaving it to me to pick up the pieces.


That hasn't been my experience. Yes, it's a culture thing rather than a technology thing, but with a monorepo the "core" or "foundation" or "developer experience" teams tend to act like they're the owners of all the code and everyone else is just visiting. With multiple repos that's reversed. Each repo has its owner, and the broad-mandate teams are at least aware of their visitor status. That cultural difference has practical consequences, which IMO favor separate repos. The busybodies and style pedants can go jump in a lava lake.

> with a single repo, bad colleagues are at least forced to have passing tests

Passing unit tests, big whoop. Maybe sometimes light functional tests. Integration/system/stress tests that have to run across an entire cluster for non-trivial time to get meaningful results and thus can't easily be kicked off from a commit hook? Not a chance. The coverage that results is no better (or even different) than what you'd get with separate repos.


> with a monorepo the "core" or "foundation" or "developer experience" teams tend to act like they're the owners of all the code and everyone else is just visiting.

Fwiw, this hasn't at all been my experience with this kind of thing at Google. Certainly developer experience teams and language teams will make broad changes that affect everyone but

Those changes are trivial: the change maker has to make a convincing argument that the change can't break anyone, if they can't, it will require local approval from the owner.

They don't happen all that often: ~once a month for any given leaf directory.

They require a special approvals process where you have to answer why the change is necessary, why the churn is worth it, and why it really is safe, and convince a group of approvers that this is the case.

If they do break something, the change-maker has to roll it back and then either reconsider, or fix the issues and try again.

>The busybodies and style pedants can go jump in a lava lake.

Consistent style is important. Among other things, it means that unmaintained code (ie. the stuff written 3 years ago that does its job) still gets updates to be consistent with everything else, so that you don't end up with ancient spaghetti that breaks every modern style rule (I mean you still get that, but less). It also allows deprecation. If there's a core library that does something "wrong", they can deprecate the badly behaved stuff and make sure everyone is off of it: if its not in trunk, it isn't being used.

>The coverage that results is no better (or even different) than what you'd get with separate repos.

This is untrue. If I'm the `core` team who maintains the `core` repo that everyone depends on, and I make a change that breaks you but my tests pass, you don't know it. Then when you version bump, fixing yourself is your problem. With a monorepo, it's my responsibility to fix you before I can make my change.


> Consistent style is important

Many things are important. Some things are more important than others, and I'd say that "not making a thousand developers' workflows more cumbersome" is higher on the list than style issues.

> If ... I make a change that breaks you but my tests pass, you don't know it.

So don't do that. Open-source projects deal with this exact same issue across repos and owners all the time. There are responsible ways to do it. Mostly they involve learning to communicate as peers, with respect, instead of "core" teams imposing their neophile opinions on everyone else. If we're all at the same company, regardless of whether we use one repo or many, there's no excuse for you not to validate your changes against other groups' tests as well as your own. Diligence does not depend on a particular repo structure.


Elsewhere in this thread I've seen just the opposite. Tons of people claiming variants of "breaking changes should just bump the major version."

I'd argue that in the long run, not being able to update dependencies because they broke you is going to be much worse than them fixing the incompatibilities for you.

Either way, you need people to act like adults and communicate, but the multirepo problem is worse.


If you don't touch the same files the merge is trivial.

If you do - then it's a good thing that you see there's a conflict right away, rather than notice versioning problems between your separate subsystems in integration testing (or even worse - production).


Why would those things be issues? At Google there multiple new commits per second and it works mostly fine, you just need a VCS which is made for a monorepo.


You're not a curmudgeon, you're just wrong. As another comment states, you'd have to fix the merge conflict anyways and if there's no merge conflict, your rebase isn't an issue. If you're getting merge conflicts and need to rebase your project and that's challenging, it turns out to be an...organizational issue.


It seems facebook built tooling to address that issue. I doubt they would tolerate that lifestyle either:

https://code.fb.com/web/rapid-release-at-massive-scale/


A lot of the pain you describe boils down to insufficient tooling and bad code organisation.

Outside of mega-corps, usually only a few people (couple of teams at best) are working on a given section of code at a time. Coordinating changes between maybe 12-15 people is quite feasible. Most of the time it's enough to keep code nicely segregated by paths - something like $team/$XXX or $scope/$team/$YYY should work.

On top of that, you need two things to enable a nice workflow:

* server-side (or otherwise programmatic) merges/rebases only; no human should ever need to push to master directly. That's the job of the [pre-]CI machinery.

* comprehensive pre-merge testing before the server-side merge. Do all your development in branches, and have bot+CI test _all_ unmerged branches against MASTER+YOURBRANCH on every push to YOURBRANCH. Because no human is involved when merging to master, the tip of the master has not moved due to external factors. Also, rerun tests if master has indeed moved thanks to bot having merged another branch ahead of yours. Fix any test breakages in the branch.

To make the second item work, you have to realise that the testing steps need to be rapid enough. Usually it should be enough to test the new-to-be-branch against merge problems, code convention errors and have a run through all the usual unit tests. Most of the time you can leave any larger cross-service or integration/end-to-end tests for code that has landed in master already.[ß]

Once code has landed in master, CI can pick it up and produce the release artifacts. The tooling needs to be good enough to know how to avoid useless work (doing useless checkouts and running tests against dozens of branches will get expensive). It also needs to provide very good and easily actionable feedback. You want clear test results, with quick jumps to failure(s) and robust logging.

To give some context of where the above is coming from... We have a monorepo with more than 130 projects, and about 7% of the codebase changes in any given month. (Except for December. Understandably.) We also clock more than 40k pre-merge test runs a month. Once a merge request has been approved, it is often available for shipping in 20 minutes.

When builds against master never fail due to code conflicts, development velocity is maintained better even with multiple teams working on the same piece of code.

ß: this is a tradeoff between development friction and test coverage. A sufficiently thorough integration test can take anything from a just a few minutes to couple of hours. You want to run them against batches of changes, and in case of introduced failures, incentivise teams to put in new unit tests to cover as much of the uncovered error scope as possible.


The article is great summary of the pros and cons.

What is still missing from the default tooling is a way to make a change across repos.

At GitLab we're working on group merge requests to solve this https://gitlab.com/gitlab-org/gitlab-ee/issues/3427


Unless you are pure OSS or pure closed source - you end up with a poly-repo strategy regardless as you split open and closed code, suffering the annoyances of both systems.


No, what you end up with is a system for mirroring open source code into your repo, and a system for mirroring commits that should be open source from your code into external repos. All active work still happens in a monorepo.


Any good examples of small companies that pull this off and the bots/CI/tool whatever they use to do it?


The truth is that you're not going to get to make this decision. If you're starting greenfield, you're going to start a single repo for your project. If that greenfield is the whole company and everything is part of that project, you get a giant monorepo. If greenfield is a new division that's not part of another project, you're going to create a new repo, and now you're in a polyrepo environment.

Which way it goes is determined by the environment, wherein the engineers do the sensible thing at the time. Then you do the engineering to solve the problems with whatever way you went.


At Mozilla, we started with a monorepo, then went to a mostly-monorepo, then consolidated just about everything back into a monorepo, and have since "decayed" a bit to a mostly-monorepo.

There were major gains from merging our source tree with the continuous integration support code and configuration. We've pretty much always vendored selected third-party code, so that didn't really change. Large collections of tests and their infrastructure have been much easier to manage as part of a monorepo.

Given that our tooling efforts are mostly in handling a monorepo, I can't really judge how differently things would be if we had gone full multirepo -- my experience with multirepos has been pretty awful, but that's an unfair comparison since we intentionally haven't worked on tooling for it. We solved our worst multirepo problems by going monorepo, but I'm sure other projects have solved their worst monorepo problems by going multirepo, so neither really proves anything.

The push for separate repos these days mostly comes from social reasons -- if you have a separable piece and want external contributors, that's a strong motivator for putting it on github since that's just Where Things Are these days. No matter how good your tooling and issue tracking and whatever is, even if it's far superior to github's, it doesn't really matter. People have to learn yours; they already know github's. I don't particularly like github's workflow, but I still use it.


I've started a greenfield project that involved multiple modules, and I created a different repo for each, but I could just as easily have created a single repo.


The biggest advantages monorepos have offered is development of tools like lerna(1) or yarn workspaces.

Before that there used to be a node_modules folder with GBs of [useless] data in all my projects. Now there is just one folder on top and that's it. Also if you're developing lots of modules or plugins it makes it super to work without committing changes since they are symlinked.

(1) https://lernajs.io


node_modules is an interesting worst case of package management systems.

There's some good exploratory work currently happening on making node_modules and the node package ecosystem better in general, but especially in the polyrepo case. Yarn "Plug'n'Play" is one, and Tink [1] the other.

[1] https://npm.community/t/tink-faq-a-package-unwinder-for-java...


>Scaling a single VCS to hundreds of developers, hundreds of millions lines of code...

Maybe I am way out of my element here, but is this a common problem? Do companies with only “hundreds of engineers” really have “hundreds of millions of lines of code”?


From personal experience, it can happen. At one point, I was personally responsible for about 2 million lines of code. Over several years, I was able to reduce it to about 500k through generous use of code generation for ORM type work. The generated code never ended up in VCS, but the generator and model did. Certainly helped checkout/update times as there was several thousand fewer files to deal with.

I was one of about 900 engineers at a financial company of about 1500 employees at the time.

I don't honestly know how many lines of code there were across the company, but I imagine it easily exceeded 100M. It took us a full week to do a full recompile of everything. We had no CI... Was always a problem approaching release time.


> I was one of about 900 engineers at a financial company of about 1500 employees at the time.

Can you say which company it is, or give a few more details? I'm fascinated to know which financial company can consist of 60% engineers!


Sorry, no, NDA at what not. But, it was a very technology driven hedge fund, not one of the big banks you read about in the news.


> is there any real difference between checking out a portion of the tree via a VFS or checking out multiple repositories? There is no difference.

How big is your monorepo? Assume each line of code is a full 80 characters, stored via ASCII/UTF-8. That 67 million lines of code in 5GB. I can fit five of those on a Blu-ray.

> The end result is that the realities of build/deploy management at scale are largely identical whether using a monorepo or polyrepo.

True.

> It might be deployed over a period of hours, days, or months. Thus, modern developers must think about backwards compatibility in the wild.

Depends entire on the application. Lots of changes are deployed within short periods of time with low compatibility requirements.

> Downside 1: Tight coupling

Monorepos do often have tightly coupled software. Polyrepos also often have tightly coupled software. Polyrepos look more decoupled, but pragmatically I can't say I've noticed a much of difference.

> Downside 2: VCS scalability

I've also heard Twitter engineers complain about the VCS. But what is the scope of the author's discussion? 1,000 engineer orgs? Or 20 engineer orgs? Those are vastly different levels of engineering collaboration. I assume the article was not written to cover both of those. Or was it?

---

Ultimately, I think the author implicitly assumed a universe of discourse of gigantic repos with hundreds and hundreds of daily contributors.

When people talk about the spectrum of monorepo vs polyrepo architectures, that is very extreme. For example, last I knew, Uber has more repos than it did engineers. And I don't assume that "polyrepos" always means multiple repos per engineer.


No silver bullet here, I think.

It's definitely the case that a mega monorepo doesn't, in practice, have the atomic commit property. E.g. once you add owner files and separate code reviews, you're in for a world of hurt. Case in point, Google developed an internal tool to split cross-cutting CLs into manageable pieces, wrangle all the owners and approvals, presubmits, etc, and then submit the CL piecemeal--i.e. not atomic.

Chromium uses a different model. It just DEPS's in other repos at pinned versions. That has a whole other set of problems.


(Disclaimer: at Google)

It's not quite so black and white. It's true that repo-wide refactorings often get carved into little changes and so aren't made atomically, but they're the exception rather than the rule. Any small change, e.g. changing an interface and the 5 callers of it, _can_ be made atomically. And changing code that's reused a small number of times is a far more common case then changing core libraries the whole company uses, so atomic submit ends up being hugely valuable.


I've been in a project where some (authoritative) people had a tendency to split things into separate repositories for very small things, e.g. repositories with a single class. This was pure developer hell. Any change meant changing at least 3 repositories, including a review for each change. Never understood this decision as all parts needed to be on the latest version anyway. Caused lots of dependency and versioning issues too.


You know what's worse than a monorepo? A duorepo. Yes, that's right, two huge repositories embodying all the problems of a monorepo, but coupled in such a way that it's easy to break something if the commits and deployments from one are out of sync with the other. It's like drinking both bottles of poison, yet it (or minor variations such as three or four entangled ginormorepos) is a thing that really exists.


Alternate title: monorepos - ideal for teams under 100 devs


At Uber, both of our iOS and Android teams are over 100 contributors each and we have a Monorepo for each app platform. I'm not on the ops team but being in a Monorepo here has been one of the best development experiences in my career.


FB has a monorepo for most of the known universe


And Google has one that includes the multiverse.


> we have a Monorepo for each app platform

Do you mean two separate mono-repos, one for Android, one for iOS? To me that's not a monorepo. Is there little shared code between the two platforms, or is there a third repo that is depended on?


Yeah, it's two separate monorepos. There is actually some shared code between multiple monorepos. Code like IDL's and some C mapping code are shared, but they are referenced as vendor libraries so each monorepo updates those dependencies when they need to. If you think about Android vs iOS vs a web based dispatch system vs an autonomy system; They are all totally different. All of the code, dependencies, vendor code (external dependencies), and tools in each monorepo are "platform" specific and live in each respective monorepo.

Each monorepo is solving one cognitive problem domain with all of the libraries and dependencies that go with it usually for a target platform. Every iOS app that Uber builds lives in one repository — you clone that down and you can build Rider, Driver, Eats, Freight, and Jump. They all use the same networking stack, same UI stack, same map stack, same VIPER architecture, etc. This makes it easy for someone who is an iOS dev to work in any app or spin up a new app with almost no effort.

But, you're right... it's not a monorepo in the sense that all of the code in the company is in one place. Maybe I like micro-monorepos.


That's okay. I like micro-macroservices.


The title should have been "Monorepos: Please don't, at scale", but I suspect it was made intentionally controversial.


Which large companies use polyrepos? Google, Facebook, Microsoft, Uber, all use monorepos.


Microsoft doesn't use a monorepo. Bloomberg, Dropbox, and Amazon don't either.


Microsoft switched all of windows to a mono repo.


I think Windows was already a single repo forever. IMHO, that's exactly what is slowing it right now.

Other products have their own repos, sometimes multiple. As of 2016.



Amazon


If it's not at scale, is it even a monorepo? It's just… repo??


At the bank I used to work for, we had a monorepo with around 2500 developers committing in an average month. Total codebase was 35M LOC.

I gave a talk on it, "Python at Massive Scale", at PyData London 2018. Videos on YouTube/PyVideo.


It’s easier to divide modules than put them back together. Wait until you know what the real boundaries are (they tend to change as scope and functionality expand)


Open source software workflows are very common and provide a _lot_ of tooling, e.g., Maven, bundler, npm, etc. Add semantic versioning and you have a lot of tooling that you basically get for free for polyrepo setups. With monorepos, you have to really spend a lot of time tooling, because you basically don't use the OSS tools.

There's a lot of odd arguments in this blog that are very spurious:

"If an organization wishes to create or easily consume OSS, using a polyrepo is required."

What? _Consuming_ OSS is usually not that bad. I've even imported the complete history from external repos, pretty easily. (It does suck with git but I wouldn't use git for a monorepo...) _Contributing_ to OSS is tricky, but the fact you use a polyrepos don't really help you much there either.

"Polyrepo code layout offers clear team/project/abstraction/ownership boundaries and encourages developers to think carefully about contracts."

Clear ownership boundaries has _zero_ to do with polyrepos. In fact, I'd say monorepos can be easier, since you say "everything under this directory is owned by X,Y,Z". There's no search function that's required to figure out where some other team hid their code. So many times, with polyrepos, projects are _hidden_ because they're off in some other grouping unit that you're not a member of. So you don't even know who owns what or where it came from.

In the end, I'd still strongly recommend using polyrepos because you get _a lot_ of tooling for free, and most integration issues are solved with semantic version locking and CD automation. But the arguments here are not really great.


I suspect the problem most people end up trying to solve isn't "how do I technically scale my tools", because the author points out that tools and techniques for this already exist and its an already solved problem.

Instead my experience has largely been that the problem to solve is "how do I make some few hundred developers behave in a predictable way" in the scenario where you have many ok developers, but that you can't really be sure that none of them will break stuff because you are trying to solve organisational problems of keeping people with merge rights to only the people who wont break things but at the same time not bottle necking development on to small a number of people then sure, split your repos up so that people can only break stuff that they 'own'.

But at least be honest about the fact that most of the technical issues of having a monorepo have been solved already so the issues you are probably trying to solve are actually people problems.


The VCS/codebase-tooling-size argument rings a bit hollow.

We have really good code-search tools that are heavily optimized and indexed (from ripgrep/silversearcher to more centralized things like hound, when local-disk performance just won't cutit).

It's not hard to optimize Git workflows to be faster with relatively simple tricks, and if that absolutely doesn't scale for some reason and VFS isn't an option, there are always centralized VCS systems like Perforce that solve this. P4 gets a lot of shit, but it's really good at solving the gigantic-repo domain; tune your client properly and you can initial-sync 10+ GB repos in the time it takes to get a cup of coffee (and, if your company is large/old enough to have a repo that big, it can probably afford the Perforce licenses).


I feel like a lot of arguments against monorepos assume micro services are the _only_ option on the other side.

I tend to break up most of my projects at the edge of business logic or domain logic and lean on a package manager to “deploy together” like any other dependency that’s not in your repo.

This allows teams to work independently without a large sprawling repo. If you’re following anything semver-ish hopefully your other teams in the company aren’t breaking releases and you can auto-upgrade on patch level changes. If not, we’ll thank goodness CI is there.

I’ve always had difficulty working in projects with too many purposes. This helps me focus where to put things and gives an easy point of escape if a dependency needs to become a service in the future.


It's worth pointing out that while Google has a monorepo in Google3, it also doesn't at the same time. We have are other projects such as Android, ones based on Chrome, etc. that are composed of multiple git projects and use repo to manage and sync.


This is just modularity in the broader information-related development handling. People tend to get political about things that make their involvement comfortable and/or when it's not them who are to deal with the pesky consequences. I strongly suspect that that must have been the cause for how giants like Google¹ or Facebook² ended up with monorepos. It is developers' workout or letting them be with their cake; kick the problem down the road and hope to acquire in time the resources necessary to throw at it later.

¹² "Don't be evil" (with developers, among others) and "move fast and break things" most definitely asks for cutting a few corners here and there.


Since i knew that Google did it, i have started to think about it, a lot.

And mono-repos really do make sense ( a lot) when you need them tied together. Finding errors in your console immediatly without version numbers gets the job faster done.

There are other ways though, like if you use dot net. A mono repo that creates nuget packages and projects that pull the latest build of them into their solution. This way, external parties can re-use the same components.

On a beta version, that releases new nuget components, if there is a file change ( and so a version update), notify the external parties.

Have one website which mentions the schedule of an update on the live version to reduce email traffic. Oldskool, but it seems to work.


The real issue is that this is more nuanced than is appropriate for one-size-fits-all advice. "Everyone should use a monorepo" isn't helpful, but neither is "everyone should not use a monorepo."

Sadly, this article falls into the same trap.


As always, the real world is a whole lot of gray between the black and white articles that are fun but useless.

Multirepos like microservices are all about scaling people, not the project. Start with the monolith and monorepo until you need to split, and then focus on separating by groups of logical functionality or team responsibilities (although if those are different then you'll end up with other problems).

Also stop taking things literally. Monorepo does not mean you must have everything in a single repo. Even a startup can put the majority of the codebase in one place and have things like a corporate website or small admin backend in another.


I am sure this author means well but I respectfully disagree with this advice.

The author is arguing against the monorepo approach and then proceeds to list out some of the most successful software companies on earth as reasons NOT to do it. The reason they were able to get to their lofty heights was in some part because they used a monorepo. The biggest advantage of a monorepo is you can move quickly and understand the implications of changes since everything is housed under one roof. That's critical for startups IMO. By the time you reach the "scale" the author is talking about, you have the resources to deal with it. Is it hard? Yes. Is it worth throwing out the baby with the bathwater? No IMO.

I currently work in a polyrepo word that the author is encouraging. I can tell you it f*cking sucks. Just take the very simple example of firing up your dev environment. In a polyrepo world, you have to individually fire up each codebase or write up some sort of script to do that for you. The former example sucks for obvious reasons and the latter example makes the case for a monorepo since one dev could author a script that could then be used by all (since he/she will know the paths to all things that need to start). Don't even get me started on setting up an environment from scratch. Containers make this easier but again, it would be nice to just rock `./start.sh` and be off to the races. A monorepo can give you that.

Pulling/pushing changes to your vcs becomes a tiresome error prone nightmare since now you need to remember to run git pull on all the codebases that touch the area you are working on. You might forget to pull on one of those codebases and everything starts breaking and now you need to stop and track it down. Dumb error? Yep. Not a thing in a monorepo? Yep. PR's become really sucky because now you need to harass your team for n PR's instead of just the one if the feature you are working on cuts across codebases. I've worked in some fairly large monorepo codebases with lifespans of >10 years and I can tell you that I have yet to encounter any of the issues with VCS scaling the author speaks of. In the future if I find myself in a situation like that you know what I'll do? Migrate to a more performant solution like Mercurial or something. Will it suck? Sure. But not as much as dealing with a polyrepo.

Then there's dependency management. Holy sweet mother of god dependency management is the worst. Lets say you need to make a breaking change to one of your codebases, in a monorepo (with decent test coverage or a type system worth a damn) you have a decent chance of tracking down everything that needs to get patched. In a polyrepo? Phttt! Enjoy those bug reports from your customers and 1am hotfixes bruh.

I really really wish people on here would stop trying to solve problems of "scale" when that's literally the last thing you need to worry about. Being able to respond quickly to business requirements is the only thing you should be worrying about until its obvious that you've made it. Then feel free to worry about scale.


Yup... This. You saved me 20m from writing this same/similar reply.

Tragically this article will be subsequently be cited by countless software managers that retain a fear of monorepos for some of the reasons cited here; junior (& sr devs) will go along with it out of not wanting to stick their necks out, and the cycle of pain will continue.


If you spend your time maintaining and improving a low level library then having a monorepo is much better than having a polyrepo. Primarily because as you make changes to the library you can update everyone else's code that uses it rather than having to wait on them to do so. This reduces the need to maintain older versions of these libraries and applications.

Additionally you can submit the single change in one go which updates everyone and it is much cleaner than having to find out and know all of the repos that could use your library and manually submit to each one of them, probably breaking some for a time in the process.


"Yeah, well, that's just, like, your opinion, man"

Monorepos are one way of solving some of the problems each organization has. Monorepos require discipline in solving those problems and if the organization is not willing to get there all the way or if it takes too much time then it's just pain and suffering for everybody.

I suspect the author works for one of those organizations that wanted to be hip but did not actually understand what it entails. Maybe faking agile and devops sort of works for you (works as in "it's difficult to pinpoint where the problem is") but faking monorepos certainly does not.


With the right tooling for both types each directory in a monorepo is equivalent to a repository in a multirepo setup. The only difference is that in the monorepo it is easier to create new repos and dependencies between repos (just add a directory in a commit or add a dependency on another directory).

The author of this piece apparently think that the ease to work in a monorepo is a bad thing, I disagree. I think that being able to treat repositories as easily as directories is awesome since it is a lot simpler so requires a lot less training for your devs to understand.


I agree, but also...

Medium: Please don't


Agree 100%. "Pardon the interruption" followed by an article with a fixed top bar asking me to "become a member" (supporting an anti-open web tech company like it's a charity) and a fixed bottom bar asking me to sign up.


Why?


TL;DR: Really bad user experience.

For a longer explanation see https://medium.com/@nikitonsky/medium-is-a-poor-choice-for-b...

Alternatives better than medium: Wordpress, Blogger, github pages, plain html files.


“For a longer explanation: see this Medium post.”

There is a certain irony there which betrays one of the problems left unaddressed.


The purpose of it being a Medium post is so you can instantly see what he's talking about.


Static analysis is easier on monorepo. At least one can run it on all code. Polyrepo has the problem that some code is off the radar. That might be the only advantage of monorepo in my opinion.


One thing I really dislike about monorepos for node modules is that you can't npm install from them directly. Unless it's a project with very fast PR merging and releases you can be stuck with a broken module that has an open or even merged PR to fix it that you can't install because it hasn't been pushed to npm. npm link might work locally, but that doesn't help if it needs to build on a CI server. If it's one repo per module then I can just npm install the git url and it works fine.


Me, I dream of a monorepo covering the whole world. Give me a single hash, and let me know the state of things as they are, reaching from the toolchain used to compile the bootloader to the state of the database, which has just dropped a row and therefore generated a new commit, forever secure, an immutable history.

I accept the infeasibility of my dream. But I'd like my repo to cover as much as my tooling realistically allows.


"If an individual clone got too far behind, it took hours to catch up (for a time there was even a practice of shipping hard drives to remote employees with a recent clone to start out with). I bring this up not specifically to make fun of Twitter engineering, but to illustrate how hard this problem is."

But mostly to make fun of Twitter engineering.

Seriously, what advantages would a big bag of billions of lines of code have?


Just read the article and it was really great read. We're using polyrepos and dealing with so many repos was not great. That's why I created "gitbatch". Gitbatch allows you to manage multiple repositories in an easy way.

https://github.com/isacikgoz/gitbatch


It's funny you say this, because my most viewed article from organic searches is about converting your polyrepo setup to a monorepo https://www.jvt.me/posts/2018/06/01/git-subtree-monorepo/


Google uses a monorepo: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

I don't think "scale" gets much bigger than Google.


TL;DR

The post just reads like some opinionated piece for traffic. The author has never even used a monorepo as far as I can tell, so can only argue from one side, the best one ever used: polyrepo. Then goes on to list 'theoretical' benefits and the downsides (which should also be theoretical if having never been used) of monorepos. It concludes with "The two solutions end up looking identical to the developer. In the face of this, why use a monorepo in the first place? Please don’t!" implying that 'Google, Facebook, Twitter, and others' do it for no benefit.


You can make shallow clones, and auto push to a single repo, it's common to for example auto push to Github from an internal repo. It sure has it's issues, but problems are solved with solutions, simply bury your head in the sand, eg. using single repos where a monorepo is the best solution - is not a solution.


I didn't actually know somebody already does this, but a conceptual idea has came into my mind yesterday: what if there was just one big code repository for a particular programming language and everything anybody writes would immediately become a part of the standard library? It feels kind of a collective brain...


To sum up the discussion of why this is either absolutely right / absolutely wrong, How about: „mono/poly-repo - none of both is THE single solution for every usecase in every organization and project“? Besides that, sure let’s keep analyzing the pros and cons of each in different scenarios...


This argument boils down to people who have used Perforce, who believe in the benefits of a monorepo, and people who have only ever used git, who do not. While it's true that git is a terrible program that does not lead to conclusions about the merits of a monorepo.


Hm, I find monorepos are a natural in javascript land. There is allot less wiring afforded by a little meta-orchestration. This is especially helpful in the repos I've worked on.

But from reading the article, it seems like there are legitimate areas where they might not fit.


One glaring omission of the monorepo design, not sure why really, is if you want open and closed source software in the same monorepo, it doesn’t seem possible. Curious as to why this design choice was made.


It's entirely possible, and repositories like https://github.com/facebook/fbthrift/ are an example of an open source project that is synced commit for commit with a private monorepo.

It just requires some tooling (like everything with monorepos)


Thanks for sharing


The word "workflow" is suspiciously absent from the OP.

Annoying workflow is my #1 complaint against polyrepos.


in my opinion several build systems / package managers have already solved this issue. The answer is that it doesn't matter mono repo vs polyrepo. Look at nixpkgs/nixos/nixpkgs if you are interested


I'm using a monorepo as a solo developer, and it's been pretty good. I like having everything in one place, so I can work on everything in a branch, including the feature, updates to API clients, documentation, blog post, etc.

One problem is that my test suite is very inefficient. I have to run through every integration test, even if I haven't changed any code that might cause these tests to fail. It's especially weird that CI runs all my tests whenever I write a new blog post. So I'm very tempted to split up some things into internal libraries and keep them in a separate repo, and add all these repos as submodules. I know this can be pretty dangerous, and it's easy to break things when you update dependencies, OS versions, language versions, etc.

If I go down this road, I have to be extremely careful to enumerate all the things that might break the library, and prevent any of these things from being updated automatically. I'll set a very strict version constraint in the package.json / gemspec, and throw an error if I detect a different version of Node, Python, Ruby, system libraries, etc. Then I'm forced to run all the library tests and explicitly bump the versions if I want to update anything.

I should also only do this when the library is a pure function with no side effects.

The really tricky part is figuring out how to write robust integration tests. API boundaries can be a big source of bugs. I think I'll do something similar to VCR [1], where the first integration test executes all of the code without any mocks, and then records the response. The response would then include those exact arguments, and it would also be tied to a specific commit hash for the library. If I change anything in the library, then I just need to re-run the slow tests, and then everything will be cached. I guess a real advantage of putting things in a separate library is that you know exactly what files are required for a specific feature, and the commit hash gives you a "fingerprint" of those files that you can use for caching in your tests.

Just have to be super careful about any dependencies that might break the library. Also I really need to start running all my tests in a Docker container which matches CI and production. I even have some screenshot tests where I have alternative versions for Mac and Linux. Would be nice to delete those. The experience was really bad when I tried to do this in the past, so I need to figure out a better way.

Anyway, sorry for the train of thought! Would be interested to hear your thoughts, and if there's anything else I should watch out for.

[1] https://github.com/vcr/vcr


the grass is always greener.


These arguments are weak, IMO.

Yes, monorepos can be slow to browse through if the VCS isn’t configured to handle the size (sparse pulls aren’t the default with Git; that alone can make a massive difference when your repo is massive). Polyrepos can be just as slow? however; what’s worse is that there are more of them.

I remember working with a repo that was >20GB large, mostly from videos (we didn’t know that initially). Pulling that repo took _forever_. Nobody on that team cared because they almost never did a fresh pull and accounted the time it took for their CI/CD to do so in their reports. If it were a monorepo, MANY teams would’ve felt that pain more immediately.

Yes, monorepos require some tooling to prevent a gazillion artifacts from being deployed at once (and to specify what’s related to what if code lives across different folders). So do polyrepos! I’ve configured a few Jenkins jobs for my clients to dynamically pull different co-dependent Git repositories at build time. It’s a pain! Especially when multiple credentials are involved! Then there’s the whole “We have a gazillion repos and 20% of them are junk” problem, which requires automated reaping; also a more difficult problem than it seems.

Same with refactors. Refactors across polyrepos are just as much of a pain because you’re now subject to n build and review processes/pull requests, and seeing the entire diff is hard/impossible. This introduces mistakes. If anything, refactors in polyrepos are more of an event than they are for monorepos.

While monorepos have their problems, I will continue to advocate for them because the ability to see what’s going on in one place and for any developer to propose changes to any part of the code (theoretically) is massively beneficial, ESPECIALLY for complex business domains like healthcare or financial services. Plus, you will have a RelEng/BuildEng team when your codebase and engineering org gets large enough; why add more complexity by creating a gazillion repos that are possibly related to each other?

(The large engineering organization without a team focussed on tools and builds doesn’t exist. If it doesn’t, that means that some/many developers are spending way more time spinning their wheels on build systems than they should be.)

The real reason why monorepos don’t happen in the aforementioned domains is because there’s no easy way to allow them and pass regulatory audits.

Many regulating bodies require hard boundaries enforced by role-based access control, especially for code that deals with personally-identifiable information or code between two or more domains that have a Chinese Wall between them. “All of my developers can check out the entire codebase” is an easy way to get fined hard, and polyrepos are much easier to restrict access into than folders in a monorepo are (one advantage not mentioned in the article). While you _can_ restrict access into directories within a single repo, doing so is not straightforward, and most organizations would rather not waste the engineering effort.

I would like to think that Google and Facebook have gotten away with it because they implemented a monorepo from the very beginning and the engineering involved in splitting it up is much more involved than engineering around it.

That said, I continue to advocate for them because discoverability is good and it builds a better engineering culture in the end. I would rather hit those walls and make just-in-time exceptions for them than assume that the walls are there and create a worse development experience without exploring better alternatives.


"Scalability" issues aren't encountered until your repo has many millions of LOCs and a lot of churn. For 99.99% of organizations this is not an issue and will never be an issue.


Overly opinionated garbage, imagine having to work with this guy.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: