Interesting insights and refreshing to see large scale providers still relying on „old-school“ tech!
What I did not fully grasp - isn’t Vitess it’s own database system built on top of a K/V store similar to e.g. CockroachDB? Here it sounds like they only use pieces of Vitess in front of regular MySQL primaries with sharding?
It’s also curious that they sharded based on domains. I assume at some point the QPS on specific domains (e.g. Gist) will be so high that partitioning no longer is effective?
edit:// Looks like Vitess is indeed a set of tools on top of MySQL that allow for these type of scaling operations. It is not re-implementation of MySQL using different underlying technologies.
refreshing to see large scale providers still relying on „old-school“ tech!
Somewhat OT but Github is really "old school" - they don't have an SPA frontend and render most things server side the classic Rails way, with lots of UX enhancements coming from judicious use of JS and websockets for interactions. That's one of the things I find very endearing about their platform.
That approach unfortunately also makes keeping state updates consistent very difficult. Some pet peeves - when reviewing code, if you set up some file filters then scroll down to trigger more files to load, those newly loaded files are not affected by the previously set filters. Or occasionally the page would helpfully offer to reload to show new commits, but clicking that link loses the current white space and file filter settings.
SPA frameworks aren’t a silver bullet here, and you can definitely fix these without using one, but these bugs are definitely much harder to make since the DOM is rendered from a single source of truth instead of being patched piecemeal.
That’s a good point. I’ve faced a couple of issues like that doing pure server side rails and mostly dealt with them by not having “infinite” scroll, though I guess that’s not an option in all cases. I think Stimulus can go some way in making experiences like that possible without a full SPA.
> In addition to vertical partitioning to move database tables, we also use horizontal partitioning (aka sharding). This allows us to split database tables across multiple clusters, enabling more sustainable growth. We’ll detail the tooling, linters, and Rails improvements related to this in a future blog post.
They make a clear distinction between "vertical partitioning" and "horizontal partitioning", I think there are probably bigger wins to be had, and probably bigger challenges in the horizontal partitioning than the vertical.
All of this work only got them a 2 factor reduction in load, whereas with an ideal horizontal solution, they could scale it infinitely.
I remember reading a post from Jeremy Zawodny back in the early 2000s that was basically: "Start virtually sharding your DB data so that when you have to physically shard it, it's a lot easier."
Also they developed their own storage engine for MySQL (X-Engine) which is based on LSM data structure and using FPGA to accelerate levels compaction [1], whitepaper [2].
"In addition, FPGA hardware is used to accelerate the compaction process, further maximizing the performance of the system. This marks the first time that hardware acceleration was applied to the storage engine of an OLTP database."
Vitess can be thought of as a very smart MySQL proxy with additional MySQL management features built in.
Centralized vtgates proxy to vttablet services which run on the same host as MySQL, the tablet then queries the local MySQL server. On top of this, a lot of magic can be built
Interesting approach, although I think SQL Server could do this kind of stuff natively (i.e. partition based on something easily indexable). Since Microsoft own GitHub, they could give it a try and potentially not go broke on licensing fees like the rest of us would.
>Since Microsoft own GitHub, they could give it a try and potentially not go broke on licensing fees like the rest of us would.
They are currently contributing to both Vitess and Rails. I am guessing they are also helping / a paying customer of PlanetScale.
Which means all the improvement and edgecase are battle tested on Github and upstreamed. I much rather they continue their current path. So others could enjoy the Github Stack.
I know HN hate Oracle and MySQL. But generally speaking I think Oracle has been doing a great job in Java and MySQL development.
I can't speak for HN but I don't hate Oracle. The engineering is top notch. It's the "screw the customer" attitude of the sales dept that no doubt emanates from the guy at the top that annoys me no end.
Read it, and well sure it is a challenge to work on an old codebase, and you need to jump through a few hoops that you don't need in pristine projects. Something tells me this guy wasn't able/willing to make the effort. (Could be a language problem, naming a "bug" a "bag" for example, u and a are not even close on the keyboard).
> (Could be a language problem, naming a "bug" a "bag" for example, u and a are not even close on the keyboard).
I think this is reading far too much into nothing. He writes 'bug' correctly about 5 times. I'm not an expert in psycholinguistics, but I suspect there's a phenomenon where you can mangle your internal pronunciation of the word and hit the wrong vowel.
Though I agree that old codebases are a particular challenge, picking on word confusion is unnecessary. Even if minutiae is important in a legacy codebase. Dyslexia is a real thing. So are linters thankfully.
Can you give more parameters for what something needs to qualify for your definition of "modern"?
The database harkens back to when computers were new. There's a ton of money that goes into continued development of it, and is vertically integrated, including custom hardware and the software to run it. It's extremely expensive hardware - they still sell SPARC servers, running Solaris if that's what you want (but they do also support a custom Linux kernel for their hardware).
It's such a high end niche that there's only a handful of companies that can even run the benchmark competitively because it just costs so much in hardware to play at that level, which makes it very opaque unless you're fluent in a lot of terms, some of them proprietary, others not. Eg https://blogs.oracle.com/exadata/post/exadata-uses-persisten... it's an absolutely fascinating journey into getting better SQL performance that involves some really high end shit, and (like kubernetes) most of the people out there just don't play in the same league. Which isn't a judgement against them and their needs, but it costs a lot of resources to wring microseconds more performance from a multi-million dollar machine. An AWS EC2 cloud VM, this ain't.
Sure, Shenandoah (started at RedHat) is even more brutally mindblowing, but it has a constant overhead (and maybe obviously, maybe not, but it builds on the already existing pretty good GC infrastructure in the JVM).
So at least it seems the Hotspot Group is left mostly alone to do their high quality work.
After I patched my third trivial bug in the Oracle InstantClient libraries when Oracle couldn't/wouldn't, that was the last straw for me. They have some impressive capabilities but I don't think it's because their engineering is great; I think it's because enough money can solve any problem.
What github did was not table partitioning but schema separation. Splitting tables between separate disk partitions works well when the storage is local to the engine and historically got around the size/io limits of hard disks but not necessarily suited to modern scaling problems.
What github did instead was to separate out whole sub-systems from the main db cluster to separate clusters so that requests could hit completely separate engines so that the number of connections to each would be reduced and the chance of breaking something by a mistake in a cross-schema query is reduced by physical separation.
I think you could still do this with SQL Server but why bother if you already use MySQL and the tools exist there.
As fas as I understand, the partitioning is rooted at the application level, by preventing cross-domain joins:
> Building on top of schema domains, two new SQL linters enforce virtual boundaries between domains. They identify any violating queries and transactions that span schema domains by adding a query annotation and treating them as exemptions. If a domain has no violations, it is virtually partitioned and ready to be physically moved to another database cluster
Table-level partitioning doesn't help with this (AFAIU), as queries access multiple tables anyway (without app-level changes).
The standand db-level feature closest to what they're doing, if somebody "wants to try this at home", is probably tablespacing (or separate dbs, in the next step).
I see some inspiration from microservices (separation of models/storages), except that they're (I suppose) keeping the monolith approach.
You can kind of do the same thing with SQL server table partitions but it would largely rely on the original data model being very clean. Depending on how old that code base is, and how experienced the developers were, that might not have been an option I suppose.
Reading this I wondered why in sharding posts I rarely see mentions of the default textbook solution to transactions in N systems, namely "two-phase commit". Upon searching I now found an informative Dropbox sharding blog post covering it at https://dropbox.tech/infrastructure/cross-shard-transactions... .
Because most of these systems in the real world are simply designed, by experience, to not use cross-shard transactions, so 2PC isn't relevant. There's nothing to "mention." It's a cultural choice as much as a technical one, really. Now, "well designed applications do not use 2PC at scale" is a baseline assumption for several reasons.
Vitess for example basically follows this principle. 2PC/Cross-shard in Vitess can be done and you could make it faster, but essentially all users and the developers instead take the view that applications designed to scale with Vitess should instead avoid 2PC at the design stage. Why do they take this view? Because their real-world experience is that sharding scales forever and cross-shard transactions don't, no matter the effort. It might scale far, but never "enough." This kind of experience is what drives such a design philosophy.
There's of course a feedback loop to all this -- nobody will use 2PC if it's slow, so therefore it's avoided, and nobody will patch it to be fast, because you can just avoid it instead of spending time on that, etc etc. Also, designing for such a system up front is maybe more difficult, or less familiar. But I think it should be noted historically and contextually that this whole push for what is effectively strongly-consistent databases on the WAN where strong transactions won't obliterate performance is a relatively recent trend. You can build something like this bespoke if you're careful but in terms of COTS tools, well, that's been very limited until recently. So the institutional knowledge that can be carried around for engineers is all built around a different set of assumptions, like "don't use 2PC."
My own personal experience talking to people "in the wild" (whatever that means) is that the Vitess philosophy is relatively popular at a lot of high-scale places, and probably isn't going away anytime soon. It helps that you can, you know, just use tools like Vitess. But the push for strongly consistent WAN databases (Yugabyte, CockroachDB, etc) isn't slowing down either, so...
Yep, the feedback loop is what I'm after in a way I guess. My takeaway from the Dropbox blog post is "it was found too slow unless the vast majority of your transactions are single-shard". If it was excluded from design options for that reason, it's relevant.
They use the regular filesystem (via a library version of git: libgit2). It seems crazy that they use the standard off the shelf option but they've contributed a massive amount of work to improving it.
Skimming search results for that I think GitHub does something similar. It seems Gitaly is an rpc layer between the frontend and git c code running on file servers, no?
On one hand there is a ton of distributed RDBMS like Cockroach, Yugobyte etc. and yet it looks like a good majority of large projects are opting for Vitess. Is coordination overhead so heavy in those systems that they are
not viable for these use cases?
When people choose Vitess over Cockroach, TiDB, etc., it's usually not because they have any reason to think Cockroach wouldn't work. Vitess is just a far simpler, safer, more predictable migration if you're already using mysql and have a ton of legacy code and operational tooling.
For a brand new project without legacy code, newer generation databases become an easier sell.
Normally a head-to-head comparison isn't properly done unless a company has already adopted Vitess and is looking to switch from that to a NewSQL database. Deploying a new database at scale is a lot of work. It's often the case that organizations pick one that has social proof and otherwise makes sense to them and go with that choice if they can make it meet their requirements. Otherwise they might have to triple their workload by comparing three different candidates while their DB is under pressure. And the comparison would likely show that one option is superior in certain ways, but that another option is better for other workloads, etc. Github had already spent effort developing and operating some MySQL coordination tools that they are probably still able to keep using them in their Vitess setup. Generally Vitess has that appeal that it seems to be similar to what you are used to and operationally easier to understand than a NewSQL system. With a NewSQL system the work of this entire blog post is unnecessary (you can still do joins, high availability fail-over is built in), but you have to learn to trust and operate a system that is initially much less familiar. Disclaimer: I work on TiDB.
I would imagine the scaling characteristics might also play some role here? For workloads that can be easily partitioned manually the efficiency might be a consideration?
In theory scaling the NewSQL system means just adding a new node. In practice the manually partitioned system may perform better but it’s actually due to trade offs that lower availability. In either case you have some metadata helping to direct queries to the right node. Of course, banning certain joins can help a db scale.
I somewhat miss being on smaller teams. I realized recently that my code has been surrounded by backend services for so long that I should probably take SQL off of my list of skills. Even now when I'm 'on the backend' I'm not talking directly to any relational database, just REST calls or KV gets.
The area I really want to chew on is partial indexes, now that it's no longer a niche feature. I'd like to have a more informed opinion of how far you can scale a system without resorting to sharding, just by use of more efficient indexes. Especially on read-dominated problem domains.
"In 2019, mysql1 answered 950,000 queries/s on average, 900,000 queries/s on replicas, and 50,000 queries/s on the primary."
Glad folks from GitHub shared this story. As a database practitioner managing couple world largest database fleets in the past and see the scale a single db can support, I am 100% convinced that only a very very tiny fraction of application needs sharding.
Using "boring" technology to build an innovative product, not the other way around.
You need a better information architecture to handle read replicas than just running a monolithic database. You need a much more complicated/limited one to do sharding.
I wonder if people see it as ripping the bandaid off to go straight to sharding. Or perhaps being multitenant introduces that idea. Seems like everyone, or at least the louder ones, discover that they have one 'whale' who gets to be too big even for a single database, or buys one of your other customers, and then where are you?
Cool, we have a similar setup with dedicated databases for logically separated parts of the app. I work at an online classifieds platform built on rails. For example we have a listing-events database that acts as an event log for our listings. It’s pretty huge so having it on a separate db is great, mainly for rds storage costs.
Interesting read, I wouldn't think such approach would work for the systems I maintain because my data is not as patitionable as theirs. But when addressing access across domains then they mentioned:
> Joining data in the application instead of in the database is another common solution. [...] In some cases, this leads to surprising performance improvements.
So, maybe such approach is not impossible for a more heavily connected dataset, although probably prohibitively expensive if you have to re-write most queries into separate-queries+in-app-join.
As described in the blog post, large paths of our database schema have grown in an organic fashion. These virtual partitions allow us to prepare our database access for splitting out groups of tables into separate clusters in the medium term, with sharding those clusters being a next step and a long term solution to handle our database growth.
We could enable the linter in production to silently log problematic queries without actually affecting their execution.
If we used separate db users as you're suggesting, any query that we didn't catch beforehand (e.g. via our CI builds) would cause noticeable problems for our users, which is something that we want to avoid.
Additionally, switching to a separate user account would require holding open twice the amount of connections to each database server (old db user plus new db user), which probably would be fine but is still a lot of additional connections at our scale.
Would it double the number of connections? If you are doing the same volume of total work and not looking to increase concurrency, wouldn't you end up with something closer to two connection pools with ~1/2 the size compared to one connection pool with your old size?
I think his suggestion was to use service-specific users with limited grants only in the development environment to catch queries that access tables of other services. The production environment would keep using users with grants on all services‘ tables
I see a lot of value of using separate users in both dev/test and production environments. That gives you an "easy" way to physically separate your database/schema into multiple databases with minimal changes to your application other than pointing your connection pools/config to the new database endpoint. We do this often - separating the task of breaking up our monolith databases with two phases - logical then physical.
Its almost like applying DDD bounded contexts after the fact.Breaking up your apps in their boundaries and moving them out on the storage level. I do have some questions though:
1. wonder what happens on the edge of the boundaries when a table does need data from another domain. And what if that domain/cluster is down?
2. How do they physically connect to the cluster? A seperate db connection?
1. Migrating is a huge, difficult, expensive project at Github's scale -- almost as bad as a rewrite
2. "Can handle more data and load" doesn't make sense as a blanket statement
How much data and load Postgres can handle vs MySQL is determined by lots of variables. It's neither absolute nor consistent. It depends more on database and infrastructure design than on the underlying RDBMS.
The best reason to choose Postgres over MySQL is not hardware efficiency but rather developer efficiency, although some projects get both.
Wouldn't it actually get easier to move the tires on a vehicle with a lot of them? You can't change the tire on a unicycle while moving, but it would theoretically already be possible on a motorcycle if you can wheelie long enough (the back wheel is more difficult) and each wheel beyond that just adds more stability and redundancy.
Uber once chose to switch from postgres to mysql, for performance reasons.
There was a lot of debate and discussion on HN[0] about it. I think there is enough complexity in either system, that you can't make generalizations that a given system can "can handle more data and load".
What I did not fully grasp - isn’t Vitess it’s own database system built on top of a K/V store similar to e.g. CockroachDB? Here it sounds like they only use pieces of Vitess in front of regular MySQL primaries with sharding?
It’s also curious that they sharded based on domains. I assume at some point the QPS on specific domains (e.g. Gist) will be so high that partitioning no longer is effective?
edit:// Looks like Vitess is indeed a set of tools on top of MySQL that allow for these type of scaling operations. It is not re-implementation of MySQL using different underlying technologies.