Partitioning GitHub’s relational databases to handle scale

rulionti · on Sept 28, 2021

Interesting insights and refreshing to see large scale providers still relying on „old-school“ tech!

What I did not fully grasp - isn’t Vitess it’s own database system built on top of a K/V store similar to e.g. CockroachDB? Here it sounds like they only use pieces of Vitess in front of regular MySQL primaries with sharding?

It’s also curious that they sharded based on domains. I assume at some point the QPS on specific domains (e.g. Gist) will be so high that partitioning no longer is effective?

edit:// Looks like Vitess is indeed a set of tools on top of MySQL that allow for these type of scaling operations. It is not re-implementation of MySQL using different underlying technologies.

d3nj4l · on Sept 28, 2021

    refreshing to see large scale providers still relying on „old-school“ tech!

Somewhat OT but Github is really "old school" - they don't have an SPA frontend and render most things server side the classic Rails way, with lots of UX enhancements coming from judicious use of JS and websockets for interactions. That's one of the things I find very endearing about their platform.

mediumdeviation · on Sept 28, 2021

That approach unfortunately also makes keeping state updates consistent very difficult. Some pet peeves - when reviewing code, if you set up some file filters then scroll down to trigger more files to load, those newly loaded files are not affected by the previously set filters. Or occasionally the page would helpfully offer to reload to show new commits, but clicking that link loses the current white space and file filter settings.

SPA frameworks aren’t a silver bullet here, and you can definitely fix these without using one, but these bugs are definitely much harder to make since the DOM is rendered from a single source of truth instead of being patched piecemeal.

d3nj4l · on Sept 28, 2021

That’s a good point. I’ve faced a couple of issues like that doing pure server side rails and mostly dealt with them by not having “infinite” scroll, though I guess that’s not an option in all cases. I think Stimulus can go some way in making experiences like that possible without a full SPA.

WatchDog · on Sept 28, 2021

> In addition to vertical partitioning to move database tables, we also use horizontal partitioning (aka sharding). This allows us to split database tables across multiple clusters, enabling more sustainable growth. We’ll detail the tooling, linters, and Rails improvements related to this in a future blog post.

They make a clear distinction between "vertical partitioning" and "horizontal partitioning", I think there are probably bigger wins to be had, and probably bigger challenges in the horizontal partitioning than the vertical.

All of this work only got them a 2 factor reduction in load, whereas with an ideal horizontal solution, they could scale it infinitely.

alexpotato · on Sept 28, 2021

I remember reading a post from Jeremy Zawodny back in the early 2000s that was basically: "Start virtually sharding your DB data so that when you have to physically shard it, it's a lot easier."

baybal2 · on Sept 28, 2021

The whole of Alibaba websites (Aliexpress, TaoBao, Tmall) rely on a single MySQL database for stock count keeping.

matesz · on Sept 28, 2021

Also they developed their own storage engine for MySQL (X-Engine) which is based on LSM data structure and using FPGA to accelerate levels compaction [1], whitepaper [2].

"In addition, FPGA hardware is used to accelerate the compaction process, further maximizing the performance of the system. This marks the first time that hardware acceleration was applied to the storage engine of an OLTP database."

[1] https://www.alibabacloud.com/blog/new-insights-into-x-engine... [2] https://dl.acm.org/doi/pdf/10.1145/3299869.3314041

baybal2 · on Sept 28, 2021

matesz · on Sept 28, 2021

https://www.alibabacloud.com/blog/new-insights-into-x-engine...

qeternity · on Sept 28, 2021

Vitess is to MySQL as Citus is to Postgres

xdanger · on Sept 28, 2021

And Microsoft owns Citus and Github.

paulfurtado · on Sept 29, 2021

Vitess can be thought of as a very smart MySQL proxy with additional MySQL management features built in.

Centralized vtgates proxy to vttablet services which run on the same host as MySQL, the tablet then queries the local MySQL server. On top of this, a lot of magic can be built

tybit · on Sept 28, 2021

What you’re describing sounds more like TiDB. Like a MySQL flavoured cockroach DB with its own K/V store TiKV.

smackeyacky · on Sept 28, 2021

Interesting approach, although I think SQL Server could do this kind of stuff natively (i.e. partition based on something easily indexable). Since Microsoft own GitHub, they could give it a try and potentially not go broke on licensing fees like the rest of us would.

https://docs.microsoft.com/en-us/sql/relational-databases/pa...

ksec · on Sept 28, 2021

>Since Microsoft own GitHub, they could give it a try and potentially not go broke on licensing fees like the rest of us would.

They are currently contributing to both Vitess and Rails. I am guessing they are also helping / a paying customer of PlanetScale.

Which means all the improvement and edgecase are battle tested on Github and upstreamed. I much rather they continue their current path. So others could enjoy the Github Stack.

I know HN hate Oracle and MySQL. But generally speaking I think Oracle has been doing a great job in Java and MySQL development.

stareatgoats · on Sept 28, 2021

I can't speak for HN but I don't hate Oracle. The engineering is top notch. It's the "screw the customer" attitude of the sales dept that no doubt emanates from the guy at the top that annoys me no end.

LittlePeter · on Sept 28, 2021

> The engineering is top notch.

Do you still think so after reading this: https://news.ycombinator.com/item?id=18442941

stareatgoats · on Sept 28, 2021

Read it, and well sure it is a challenge to work on an old codebase, and you need to jump through a few hoops that you don't need in pristine projects. Something tells me this guy wasn't able/willing to make the effort. (Could be a language problem, naming a "bug" a "bag" for example, u and a are not even close on the keyboard).

samhw · on Sept 28, 2021

> (Could be a language problem, naming a "bug" a "bag" for example, u and a are not even close on the keyboard).

I think this is reading far too much into nothing. He writes 'bug' correctly about 5 times. I'm not an expert in psycholinguistics, but I suspect there's a phenomenon where you can mangle your internal pronunciation of the word and hit the wrong vowel.

stareatgoats · on Sept 28, 2021

> I think this is reading far too much into nothing.

Could be. I could have left that somewhat tenuous side note out, my argument doesn't rest on it.

aiisjustanif · on Sept 29, 2021

Though I agree that old codebases are a particular challenge, picking on word confusion is unnecessary. Even if minutiae is important in a legacy codebase. Dyslexia is a real thing. So are linters thankfully.

BossingAround · on Sept 28, 2021

> The engineering is top notch

Is it? Can you give me an example of the top notch engineering from "modern" Oracle?

I mean this as a genuine question, I am not that familiar with recent work coming out of Oracle.

fragmede · on Sept 28, 2021

Can you give more parameters for what something needs to qualify for your definition of "modern"?

The database harkens back to when computers were new. There's a ton of money that goes into continued development of it, and is vertically integrated, including custom hardware and the software to run it. It's extremely expensive hardware - they still sell SPARC servers, running Solaris if that's what you want (but they do also support a custom Linux kernel for their hardware).

It's such a high end niche that there's only a handful of companies that can even run the benchmark competitively because it just costs so much in hardware to play at that level, which makes it very opaque unless you're fluent in a lot of terms, some of them proprietary, others not. Eg https://blogs.oracle.com/exadata/post/exadata-uses-persisten... it's an absolutely fascinating journey into getting better SQL performance that involves some really high end shit, and (like kubernetes) most of the people out there just don't play in the same league. Which isn't a judgement against them and their needs, but it costs a lot of resources to wring microseconds more performance from a multi-million dollar machine. An AWS EC2 cloud VM, this ain't.

partido3619463 · on Sept 28, 2021

> It's such a high end niche that there's only a handful of companies that can even run the benchmark competitively

Alternatively, you don’t see benchmarks because Oracle’s licensing bans posting benchmarks - https://www.brentozar.com/archive/2018/05/the-dewitt-clause-...

jen20 · on Sept 28, 2021

> Can you give more parameters for what something needs to qualify for your definition of "modern"?

The ability to tolerate machine failure with zero downtime and zero data loss.

pas · on Sept 28, 2021

JVM G1 GC is pretty amazing.

Sure, Shenandoah (started at RedHat) is even more brutally mindblowing, but it has a constant overhead (and maybe obviously, maybe not, but it builds on the already existing pretty good GC infrastructure in the JVM).

So at least it seems the Hotspot Group is left mostly alone to do their high quality work.

xyzzy123 · on Sept 28, 2021

graalvm is extremely impressive.

0xbadcafebee · on Sept 28, 2021

After I patched my third trivial bug in the Oracle InstantClient libraries when Oracle couldn't/wouldn't, that was the last straw for me. They have some impressive capabilities but I don't think it's because their engineering is great; I think it's because enough money can solve any problem.

lbriner · on Sept 28, 2021

What github did was not table partitioning but schema separation. Splitting tables between separate disk partitions works well when the storage is local to the engine and historically got around the size/io limits of hard disks but not necessarily suited to modern scaling problems.

What github did instead was to separate out whole sub-systems from the main db cluster to separate clusters so that requests could hit completely separate engines so that the number of connections to each would be reduced and the chance of breaking something by a mistake in a cross-schema query is reduced by physical separation.

I think you could still do this with SQL Server but why bother if you already use MySQL and the tools exist there.

pizza234 · on Sept 28, 2021

As fas as I understand, the partitioning is rooted at the application level, by preventing cross-domain joins:

> Building on top of schema domains, two new SQL linters enforce virtual boundaries between domains. They identify any violating queries and transactions that span schema domains by adding a query annotation and treating them as exemptions. If a domain has no violations, it is virtually partitioned and ready to be physically moved to another database cluster

Table-level partitioning doesn't help with this (AFAIU), as queries access multiple tables anyway (without app-level changes).

The standand db-level feature closest to what they're doing, if somebody "wants to try this at home", is probably tablespacing (or separate dbs, in the next step).

I see some inspiration from microservices (separation of models/storages), except that they're (I suppose) keeping the monolith approach.

smackeyacky · on Sept 28, 2021

You can kind of do the same thing with SQL server table partitions but it would largely rely on the original data model being very clean. Depending on how old that code base is, and how experienced the developers were, that might not have been an option I suppose.

I am glad it isn't me trying to split it up :-)

xdanger · on Sept 28, 2021

Also Microsoft owns Citus, which is scalable sharding for postgresql.

Scarbutt · on Sept 28, 2021

Yep, they should throw out the window all these years of MySQL experience and expertise and start from scratch :s

fulafel · on Sept 28, 2021

Reading this I wondered why in sharding posts I rarely see mentions of the default textbook solution to transactions in N systems, namely "two-phase commit". Upon searching I now found an informative Dropbox sharding blog post covering it at https://dropbox.tech/infrastructure/cross-shard-transactions... .

aseipp · on Sept 28, 2021

Because most of these systems in the real world are simply designed, by experience, to not use cross-shard transactions, so 2PC isn't relevant. There's nothing to "mention." It's a cultural choice as much as a technical one, really. Now, "well designed applications do not use 2PC at scale" is a baseline assumption for several reasons.

Vitess for example basically follows this principle. 2PC/Cross-shard in Vitess can be done and you could make it faster, but essentially all users and the developers instead take the view that applications designed to scale with Vitess should instead avoid 2PC at the design stage. Why do they take this view? Because their real-world experience is that sharding scales forever and cross-shard transactions don't, no matter the effort. It might scale far, but never "enough." This kind of experience is what drives such a design philosophy.

There's of course a feedback loop to all this -- nobody will use 2PC if it's slow, so therefore it's avoided, and nobody will patch it to be fast, because you can just avoid it instead of spending time on that, etc etc. Also, designing for such a system up front is maybe more difficult, or less familiar. But I think it should be noted historically and contextually that this whole push for what is effectively strongly-consistent databases on the WAN where strong transactions won't obliterate performance is a relatively recent trend. You can build something like this bespoke if you're careful but in terms of COTS tools, well, that's been very limited until recently. So the institutional knowledge that can be carried around for engineers is all built around a different set of assumptions, like "don't use 2PC."

My own personal experience talking to people "in the wild" (whatever that means) is that the Vitess philosophy is relatively popular at a lot of high-scale places, and probably isn't going away anytime soon. It helps that you can, you know, just use tools like Vitess. But the push for strongly consistent WAN databases (Yugabyte, CockroachDB, etc) isn't slowing down either, so...

fulafel · on Sept 29, 2021

Yep, the feedback loop is what I'm after in a way I guess. My takeaway from the Dropbox blog post is "it was found too slow unless the vast majority of your transactions are single-shard". If it was excluded from design options for that reason, it's relevant.

rkuska · on Sept 28, 2021

I believe they don't need 2P commits when transactions don't span multiple partitions. It's explained here https://github.blog/2021-09-27-partitioning-githubs-relation...

fulafel · on Sept 28, 2021

Yes, they've taken that as a constraint and paid a significant engineering and tooling cost.

paxys · on Sept 28, 2021

It seems like every large service which started with a vanilla MySQL setup is now utilizing Vitess for scaling/sharding.

Does anyone know how a service like Github would store & backup the Git repo itself? Not in MySQL, I imagine?

vtbassmatt · on Sept 28, 2021

How we store them: https://github.blog/2016-04-05-introducing-dgit/ (now called Spokes)

I don't see anywhere that we've talked publicly about how backups work. If there's interest, I can see if someone wants to write a blog post.

Disclosure: I'm the GitHub product manager for Git systems including storage and protocols.

pas · on Sept 28, 2021

> If there's interest

There is! Storage, backup & restore at that scale are always interesting :)

vtbassmatt · on Sept 28, 2021

When I was asking around, someone pointed me to a talk given by one of our engineers a few years ago at Git Merge: https://www.youtube.com/watch?v=jcN6UgPN7zA

pas · on Sept 30, 2021

Thank you!

vtbassmatt · on Sept 28, 2021

Sounds good! I'll ask around.

iudqnolq · on Sept 28, 2021

They use the regular filesystem (via a library version of git: libgit2). It seems crazy that they use the standard off the shelf option but they've contributed a massive amount of work to improving it.

https://github.blog/2009-10-20-how-we-made-github-fast/

sofixa · on Sept 28, 2021

>It seems crazy that they use the standard off the shelf option but they've contributed a massive amount of work to improving it.

Especially considering Gitlab developed Gitaly due to scale and redundancy issues with a regular filesystem.

iudqnolq · on Sept 28, 2021

Skimming search results for that I think GitHub does something similar. It seems Gitaly is an rpc layer between the frontend and git c code running on file servers, no?

qaq · on Sept 28, 2021

On one hand there is a ton of distributed RDBMS like Cockroach, Yugobyte etc. and yet it looks like a good majority of large projects are opting for Vitess. Is coordination overhead so heavy in those systems that they are not viable for these use cases?

dub · on Sept 28, 2021

When people choose Vitess over Cockroach, TiDB, etc., it's usually not because they have any reason to think Cockroach wouldn't work. Vitess is just a far simpler, safer, more predictable migration if you're already using mysql and have a ton of legacy code and operational tooling.

For a brand new project without legacy code, newer generation databases become an easier sell.

gregwebs · on Sept 28, 2021

Normally a head-to-head comparison isn't properly done unless a company has already adopted Vitess and is looking to switch from that to a NewSQL database. Deploying a new database at scale is a lot of work. It's often the case that organizations pick one that has social proof and otherwise makes sense to them and go with that choice if they can make it meet their requirements. Otherwise they might have to triple their workload by comparing three different candidates while their DB is under pressure. And the comparison would likely show that one option is superior in certain ways, but that another option is better for other workloads, etc. Github had already spent effort developing and operating some MySQL coordination tools that they are probably still able to keep using them in their Vitess setup. Generally Vitess has that appeal that it seems to be similar to what you are used to and operationally easier to understand than a NewSQL system. With a NewSQL system the work of this entire blog post is unnecessary (you can still do joins, high availability fail-over is built in), but you have to learn to trust and operate a system that is initially much less familiar. Disclaimer: I work on TiDB.

qaq · on Sept 28, 2021

I would imagine the scaling characteristics might also play some role here? For workloads that can be easily partitioned manually the efficiency might be a consideration?

gregwebs · on Sept 28, 2021

In theory scaling the NewSQL system means just adding a new node. In practice the manually partitioned system may perform better but it’s actually due to trade offs that lower availability. In either case you have some metadata helping to direct queries to the right node. Of course, banning certain joins can help a db scale.

ies7 · on Sept 28, 2021

Cockroach & Yugabyte support Postgresql protocol while Vitess support Mysql.

qaq · on Sept 28, 2021

TiDB is MySQL compatible

hinkley · on Sept 29, 2021

I somewhat miss being on smaller teams. I realized recently that my code has been surrounded by backend services for so long that I should probably take SQL off of my list of skills. Even now when I'm 'on the backend' I'm not talking directly to any relational database, just REST calls or KV gets.

The area I really want to chew on is partial indexes, now that it's no longer a niche feature. I'd like to have a more informed opinion of how far you can scale a system without resorting to sharding, just by use of more efficient indexes. Especially on read-dominated problem domains.

tianzhou · on Sept 29, 2021

"In 2019, mysql1 answered 950,000 queries/s on average, 900,000 queries/s on replicas, and 50,000 queries/s on the primary."

Glad folks from GitHub shared this story. As a database practitioner managing couple world largest database fleets in the past and see the scale a single db can support, I am 100% convinced that only a very very tiny fraction of application needs sharding.

Using "boring" technology to build an innovative product, not the other way around.

hinkley · on Sept 29, 2021

You need a better information architecture to handle read replicas than just running a monolithic database. You need a much more complicated/limited one to do sharding.

I wonder if people see it as ripping the bandaid off to go straight to sharding. Or perhaps being multitenant introduces that idea. Seems like everyone, or at least the louder ones, discover that they have one 'whale' who gets to be too big even for a single database, or buys one of your other customers, and then where are you?

efxhoy · on Sept 28, 2021

Cool, we have a similar setup with dedicated databases for logically separated parts of the app. I work at an online classifieds platform built on rails. For example we have a listing-events database that acts as an event log for our listings. It’s pretty huge so having it on a separate db is great, mainly for rds storage costs.

SkeuomorphicBee · on Sept 28, 2021

Interesting read, I wouldn't think such approach would work for the systems I maintain because my data is not as patitionable as theirs. But when addressing access across domains then they mentioned:

> Joining data in the application instead of in the database is another common solution. [...] In some cases, this leads to surprising performance improvements.

So, maybe such approach is not impossible for a more heavily connected dataset, although probably prohibitively expensive if you have to re-write most queries into separate-queries+in-app-join.

jstx1 · on Sept 28, 2021

There's something really cool about doing work at that scale.

venthur · on Sept 28, 2021

Does anyone know which linters exactly they were using?

arthurschreiber · on Sept 28, 2021

We wrote our own linters that hooked into the `sql.active_record` event emitted by ActiveRecord when a query gets executed.

See https://api.rubyonrails.org/classes/ActiveSupport/Notificati... for the API to subscribe to these notifications.

jayd16 · on Sept 28, 2021

Why invent virtual partitions when you could use something like multiple databases (as in create database) or maybe even multiple user rights?

arthurschreiber · on Sept 28, 2021

As described in the blog post, large paths of our database schema have grown in an organic fashion. These virtual partitions allow us to prepare our database access for splitting out groups of tables into separate clusters in the medium term, with sharding those clusters being a next step and a long term solution to handle our database growth.

jayd16 · on Sept 28, 2021

I understand the desire, but why were db users with rights for specific tables or separate databases not enough separation for your interim testing.

What did the virtual partitioning add that those other techniques could not provide?

arthurschreiber · on Sept 28, 2021

We could enable the linter in production to silently log problematic queries without actually affecting their execution.

If we used separate db users as you're suggesting, any query that we didn't catch beforehand (e.g. via our CI builds) would cause noticeable problems for our users, which is something that we want to avoid.

Additionally, switching to a separate user account would require holding open twice the amount of connections to each database server (old db user plus new db user), which probably would be fine but is still a lot of additional connections at our scale.

nycjay · on Sept 28, 2021

Would it double the number of connections? If you are doing the same volume of total work and not looking to increase concurrency, wouldn't you end up with something closer to two connection pools with ~1/2 the size compared to one connection pool with your old size?

eska · on Sept 29, 2021

I think his suggestion was to use service-specific users with limited grants only in the development environment to catch queries that access tables of other services. The production environment would keep using users with grants on all services‘ tables

nycjay · on Oct 6, 2021

I see a lot of value of using separate users in both dev/test and production environments. That gives you an "easy" way to physically separate your database/schema into multiple databases with minimal changes to your application other than pointing your connection pools/config to the new database endpoint. We do this often - separating the task of breaking up our monolith databases with two phases - logical then physical.

technicolorwhat · on Sept 28, 2021

Its almost like applying DDD bounded contexts after the fact.Breaking up your apps in their boundaries and moving them out on the storage level. I do have some questions though:

1. wonder what happens on the edge of the boundaries when a table does need data from another domain. And what if that domain/cluster is down?

2. How do they physically connect to the cluster? A seperate db connection?

redis_mlc · on Sept 28, 2021

> In 2019, mysql1 answered 950,000 queries/s on average, 900,000 queries/s on replicas, and 50,000 queries/s on the primary.

Impressive numbers. Busiest master I worked on has been 15,000 qps.

(SSD can do 1+ million IOPS and support 2 mysqld server processes at 100,000 qps.)

Source: DBA.

syspec · on Sept 28, 2021

That's not a source.

xvilka · on Sept 28, 2021

Why not to migrate to the PostgreSQL which can handle more data and load than MySQL or MariaDB? It's also better at partitioning.

smt88 · on Sept 28, 2021

I'm a huge Postgres fan, but...

1. Migrating is a huge, difficult, expensive project at Github's scale -- almost as bad as a rewrite

2. "Can handle more data and load" doesn't make sense as a blanket statement

How much data and load Postgres can handle vs MySQL is determined by lots of variables. It's neither absolute nor consistent. It depends more on database and infrastructure design than on the underlying RDBMS.

The best reason to choose Postgres over MySQL is not hardware efficiency but rather developer efficiency, although some projects get both.

zip1234 · on Sept 28, 2021

"Changing the tires on a moving vehicle" but the moving vehicle is one of those road trains in Australia with 50 tires.

WJW · on Sept 28, 2021

Wouldn't it actually get easier to move the tires on a vehicle with a lot of them? You can't change the tire on a unicycle while moving, but it would theoretically already be possible on a motorcycle if you can wheelie long enough (the back wheel is more difficult) and each wheel beyond that just adds more stability and redundancy.

WatchDog · on Sept 28, 2021

Uber once chose to switch from postgres to mysql, for performance reasons. There was a lot of debate and discussion on HN[0] about it. I think there is enough complexity in either system, that you can't make generalizations that a given system can "can handle more data and load".

[0]: https://news.ycombinator.com/item?id=26283348