Hacker News new | past | comments | ask | show | jobs | submit login
The GitHub Availability Report (github.blog)
180 points by mountainview on July 16, 2020 | hide | past | favorite | 98 comments



> We strive to engineer systems that are highly available and fault-tolerant and we expect that most of these monthly updates will recap periods of time where GitHub was >99% available.

So that's them striving for two nines most of the time, which seems to be a fancy way of saying one nine. Be careful you're not setting the bar too high!


That doesn’t seem to be what it’s saying. Wording is unclear, but it seems to be saying that _for the incidents recounted in these updates_, most of the time the GitHub service itself will be 99% available. That’s not the same as saying they only strive to make GitHub 99% available at all times.


Hm. Not sure I see that. The updates are monthly, so the period of time they're recapping will be the last month. And they're expecting that on "most" recapped periods the availability was >99%. I.e. they also expect to have months where availability is <99%.


a key distinction! I favor your interpretation


That number stands out more than anything else in that report.

99% availability effectively becomes the weakest link in the chain. Would they still have pushed this report out if they phrased it as "...where GitHub was only down for an hour for every 99 hours of uptime"


Also this is like South Africa Post Office. 99.9% of packages delivered! = you lost 0.01% of them which totalled millions.


I always heard that the number of nines was after the point, so that'd be zero nines. Two nines would be 99.99%


That's wrong. Five-9's for example, is 99.999% uptime, or a bit over 5 minutes downtime per year.


Maybe you're thinkinng of the actual proportion value. 99% can more simply be written as 0.99, which is two nines after the true decimal point.


Three out of four times MySQL was involved.


If you are running an app that is basically a UI layer over a database, where would you expect significant failures to occur? The users are already debugging possible problems with git locally by simply using it, so UI-database interactions are where all the action happens on products like github.


When I was CTO in some companies we had very different problems, with DBs being only one of them (and thankfully not four hour long outages in two months).


I might be wrong, but you probably didn't handle as much traffic as Github does. Bugs and outages probability is a asymptotic to normalized traffic. At least intuitively.


No we didn't luckily but we also didn't have that much money. Biggest problem was 50k+ people trying to reserve a limited amount of stuff at the same time and 5k+ logins/second (at first we made the mistake to write last_login to MySQL ;-)


We changed to a write behind setup (e.g. with Redis PubSub or any other queue). I think everyone with some experience will use async write behind and not sync writes during a request as we did.


How did you go about saving login/device data? We are going to do this soon in our app and I'm looking for good use cases and solutions


What confuses me is that CPU starvation caused crashes. Sure, too few CPU ressources should cause a slow down but the database process itself shouldn't crash as a result. I haven't dealt with MySQL that much but at least for Postgres, I've never seen 100% load to cause a node to fail.


I could see a buffer for requests filling up too much so that it causes a no memory crash.

But even that seems preventable.


OOM issues often don't _really_ cause a crash unless you are running with real memory only. Usually long before you are out of total memory you are using so much virtual memory that the kernel is thrashing IO to try keep up with all the page faults. At this point the machine hasn't crashed, but from the outside it might as well have because it can't respond to anything new, even diagnostic intervention, in a timely fashion.

CPU overloading, without memory starvation, can have a similar effect, though that doesn't tend to fall off the same efficiency wall as when IO is the issue (unless something has literally kicked off a fork bomb) because of the orders of magnitude differences in latency and work throughput.


A swap livelock would be worse than a crash, you really don't want a DB backend to start swapping in this kind of application.


No, but from a diagnostics PoV it can be important to know the difference between failure modes that, unless you are physically near the device so can see drive lights flickering (or hear heads moving if using traditional drives), could otherwise look identical.


I think the hosted SQL model may be fundamentally flawed when you are trying to do something as complex as what GitHub is doing. You are basically saying "Here, MySQL/Postgres/Oracle/et.al., I trust you with 100% of my replication & persistence logic. Good luck optimizing all that SQL!".

I have started looking at tying together clusters of business apps that each have independent SQLite datastores using an application-level protocol for replication & persistence (over public HTTPS). This allows for really flexible schemes which can vary based upon the entity being transacted. With hosted SQL offerings, you are typically stuck with a fairly chunky grain for replication & transactions. If you DIY, you can control everything down to the most specific detail.

For instance, when going to persist a business object's changes, you can have a policy configured that reflects over the type and determines how many nodes need to be replicated to and which ones should replicate synchronously vs asynchronously based upon it. Log entries may only be async to 2 nodes, but you may decide that any accounting transactions should be replicated to 2 near nodes synchronously, with 2 more far nodes being async.

Additionally, you could inspect various business facts in the entity (or session related to the transaction) in order to determine ideal persistence strategy. One powerful example here could be to use a user's zip code to determine the geographically-ideal node to store the primary replica of their account & profile data. This could allow for lower latency access for that specific user.


> I have started looking at tying together clusters of business apps that each have independent SQLite datastores using an application-level protocol for replication & persistence (over public HTTPS). This allows for really flexible schemes which can vary based upon the entity being transacted. With hosted SQL offerings, you are typically stuck with a fairly chunky grain for replication & transactions. If you DIY, you can control everything down to the most specific detail.

Ugh, I'm totally going to sympathize with whoever eventually joins your company after you've left who has to own this.


I don't think I have provided enough contextual details in order for this type of conclusion to be reached.

I'd be happy to offer a hypothetical if you are interested in debating the merits of my approach.


I'm not sure it even makes sense for smaller platforms any more. The metric of most business success is really growth. Tying yourself to something which is troublesome to grow because it favours doing things that are difficult to scale is an evolutionary dead end. I've spent the last 25 years constantly battling this problem in relational databases. Moore's Law has been pretty kind so far assuming you have the money to buy ridiculous computers and licenses.

I am out of touch of what is available in the NoSQL side of things but about 5 years ago there were too many things which had a terrible failure mode somewhere in the stack thus I'm never sure which way to go instead.


Boy in 2020 I can't see describing databases as difficult to scale. Expensive, sure, but these are solved problems, and the tradeoffs for going the NoSQL route will likely exceed its perceived benefits.

Modern relational databases are great at what they do.


They are until they're not, at which point it's too late. I deal with things teetering on the edge.

At some point you get to machine sizes that won't fit in AWS as an example.


That's true of, well, anything. A non-RDS / NoSQL datastore doesn't save you here!


It's more about the granularity of the architecture than it is any specific technology used within it.


Three out of four times MySQL was involved in the success of products and companies.


I don't understand the first one:

    a shared database table’s
    auto-incrementing ID column
    exceeded the size that can
    be represented by the MySQL
    Integer type
Followed by:

    GitHub’s monitoring systems
    currently alert when tables
    hit 70% of the primary key
    size
So why was there no alert?


The latter looks to be the solution put in place to deal with the former. That seems to be the pattern with all of these - first list the issue, then the last paragraph describes the remediation.


The report could certainly be clearer here.

They don't unambigously say that the table with the primary key was the one rejecting the data, and they do say « We are now extending our test frameworks to include a linter in place for int / bigint foreign key mismatches. »

So perhaps the primary key itself was 64 bit, but a column referring to that key in a different table was mistakenly 32 bit.


Sound like they are using 32 bits keys and reached the limits of 4 billion items.

Had the same thing happen in my company earlier this year in our homemade chat system.

Why didn't they get alerts? Because the alert was probably counting existing items in the table, not deleted elements that did increase the auto increment.


My interpretation is that "currently" is referring to right now, after that incident. I assume they meant to suggest that the monitoring system was put in place as a result of the incident.

Or, I'm wrong. Maybe there wasn't an alert, or maybe the alert was ignored.


More importantly, why is their ID type small enough to where this is even a possibility?


The "default" is a 32 bit int, large enough for almost 4.3 billion records. That is enough for the vast majority of tables, and going to a 64 bit int has some performance implications, which at GitHubs scale is absolutely something they need to take into consideration, just as they should have with the table size.


I've set up systems where I chose 32bit int because I could've never imagined to need more than 4 billion values. AT some points I did (related to deletes) but changing the datatype when you have >1bn records can be next to impossible for 24/7 operations, especially for primary keys.


Yes, a lot of down time for sure, but each of these is quite an unexpected edge-case. I wouldn’t think that many of these issues would be reoccurring as they have added regression tests and process in place for each.


I don't wish to discredit the work that's been done by github here; nor do I think it's reasonable to say they're negligent.

However, at least two of these cases are actually something that I test for as part of the practise of systems administration at large high throughput companies.

Transaction ID wrap-around (and auto-increment capacity) are known-knowns in database administration;

The CPU starvation and flap detection mechanisms are also known-knowns in systems administration.

The premise that they came out of the left field is disengenuous at best;

unless I have "special experiences", which is possible I suppose as I was responsible for 1% of all web-traffic at one point in my life... but github should be much higher than that, I am surprised that they haven't tested the first principle assumptions, and controlled for them.

Similar to how a developer sprinkles code with asserts to ensure things that are impossible remain impossible; a sysadmin doesn't take for granted that a service will 'be magic'.


I'm pretty sure "responsible for 1% of all web-traffic" puts you in the "special experiences" category :)


I guess then, the question is: Why does nobody on the github staff have a similar experiences to me, when they (assumingly) operate at even higher scales?

Or, if they do, why were they not in on the design meetings? Or, if they were, why were they ignored?


Yep. And several of them seem like issues that would be very difficult to reproduce outside of production (e.g. overflowing primary key index), so it makes sense that they were not caught earlier


It's actually an error I've seen multiple times in the past, and I'm not even working with databases full time. I do think it's surprising that no one had thought of implementing at least checks for these conditions.


Yeah, I think most people who touch/interact with backend/database code, even if not actually backend/DB developers, have a reaction nowadays to seeing auto-incremental IDs because it's famously hard to scale to any distributed architecture + introduces issues when you hit the limit.

Projects started in the last three years that I've collaborated/been part of have all ditched the auto incremental IDs.


Auto incrementing integer ids have some very desirable properties though. Not sure what you replaced them with.


Something I see more and more is a primary key based on guid/uuid;

I'm not fully aware of the merits of either approach, so I'm just putting the info I have at hand out there.


It seems that they still don't know what caused the June 29 outage, though.


A couple of observations:

It seems that MySQL seems somehow connected to all the outages, especially the unexpected crashes.

>GitHub’s monitoring systems currently alert when tables hit 70% of the primary key size used. We are now extending our test frameworks to include a linter in place for int / bigint foreign key mismatches.

In 2020, should we have Integer(32-bit) primary keys anymore? I think at this time, everyone should just go with BigInt or UUID for primary keys/foreign keys, and basically not have running out of key space be an issue you have to worry about.


I always use

    UUID PRIMARY KEY DEFAULT uuid_generate_v1mc()
In Postgres. It will give you UUIDs and the larger keyspace, without being excessively random (they're basically almost sequential). I do this always, even if the table has other int columns that look like friendly values that could be used as PKs. Some time in the future they will ruin your day.


If you want to use UUID for primary keys, the argument should not be the larger key space.

A 64 bit int is enough for a single server. If a 64 bit int PK is exhausted, the data used to store the keys alone would require 128 exabytes, or in perspective, a SAN with 6.4 million 20tb hdds. Then you need several times that to store any kind of meaningful data

If you need to use UUIDs for other reasons such as a high degree of unique keys between servers that do not coordinate, then go for it, but not for the key size.


> 128 exabytes, or in perspective, a SAN with 6.4 million 20tb hdds

Nitpick: It's actually 7.38 million disks since "20 TB" for a disk means terabyte, not tebibyte.


True - not that it changes the soundness of the argument.


UUIDs can have severe performance implications. Even without massive randomness, a change in the MAC address used or non sequential numbers can slow down writes massively. As others have written, a simple 64bit integer should be enough for pretty much any case. Even Google won't exhaust those for their user database.


v1mc does not depend on the MAC address.


For others curious about this uuid type:

"This function generates a version 1 UUID. This involves the MAC address of the computer and a time stamp. Note that UUIDs of this kind reveal the identity of the computer that created the identifier and the time at which it did so, which might make it unsuitable for certain security-sensitive applications."

I guess that's not immune to problems either (i can imagine problems with both mac address and time).


This is not the right function. V1mc randomises the MAC. Also it's only one of the possibilities. There are many ways if generating uuids depending on your situation.


Isn't UUID discouraged for a PRIMARY KEY pertaining indexes?

What would the B+Tree look like for over 2 billions UUIDS where there is no order within? Also, the cache locality problem.


This is why GP wrote "It will give you UUIDs and the larger keyspace, without being excessively random (they're basically almost sequential)". The main point for not screwing up the clustered index is the "almost sequential" part.


That is for random (v4) UUIDs. v1 UUIDs are mostly sequential in time on a single machine so don't have this problem, altough they do have problems that others indicated already.


I could be wrong, but I thought UUIDs could cause some performance troubles for larger table when there is an index on it?

edit: This explains it a bit

> When you insert a new row with a random primary key value, InnoDB has to find the page where the row belongs, load it in the buffer pool if it is not already there, insert the row and then, eventually, flush the page back to disk. With purely random values and large tables, all b-tree leaf pages are susceptible to receive the new row, there are no hot pages. Rows inserted out of the primary key order cause page splits causing a low filling factor. For tables much larger than the buffer pool, an insert will very likely need to read a table page from disk. The page in the buffer pool where the new row has been inserted will then be dirty. The odds the page will receive a second row before it needs to be flushed to disk are very low. Most of the time, every insert will cause two IOPs – one read and one write. The first major impact is on the rate of IOPs and it is a major limiting factor for scalability. [1]

1: https://www.percona.com/blog/2019/11/22/uuids-are-popular-bu...


OP mentioned they are using one of the sequential UUID variants so this is not an issue.


I believe they still use the MAC address? If the master node changes, the number could suddenly start at a different block, causing massive performance impacts. If they're purely sequential you can as well use a 64 bit int.


> If the master node changes, the number could suddenly start at a different block, causing massive performance impacts.

What massive performance impact? You'd have one extra IOP, total, as it loaded the page for the new mac address, and then it would carry on as normal. It would be no different from when a sequential ID crosses a page boundary.


Good on GitHub for being transparent. That said, I hope they return to being a more reliable platform soon


Everything is easy and obvious in retrospect. If you wonder how they missed something as simple as PK size.

As part of maturity model, every product should have periodic milestones base on their scale where engineers should reassessment their infrastructure and the choices they made earlier. But, you can always overlook things unless you have a checklist for everything.


> GitHub’s monitoring systems currently alert when tables hit 70% of the primary key size

This is interesting. I wonder if they query the key size every N seconds as part of their monitor, or if they report the key size to the monitor on write.


We have a similar setup that just queries for the auto increment value every X period of time, creates next to zero additional load to do so.


Thanks.

Yeah that seems sane. The odds of you exceeding your monitoring threshold in 1 interval seems close to impossible too. Such as going from 70% to 100% in 30 seconds or whatever your monitor interval is set to.


I'm curious about how Github generates 32 bit IDs in a distributed system.


The last one is really funny to me in a slapstick way and anti-flapping sounds like a really appropriate term.


I am wondering if Github ever consider switching to Postgre?


facebook can learn something from github


External production systems depend on GitHub. For FB - it is fine to fail from time to time. Users will be even more productive :)


External production systems unfortunately depend on FB too as we've seen with all the iOS apps crashing due to issues with FB's iOS SDK.


Even more important, Github lives off fees paid by companies. They might switch to Gitlab or other competitors if availability remains an issue. Facebook lives off ads, as long as people visit FB, companies won't really take their ads somewhere else.


In other words, so many outages are happening that we're only going to write about it once a month.


Interesting...


Who doesn’t think to use bigint for their auto incrementing IDs in Rails? This seems like it should be a non-negotiable for a company of scale such as GitHub.


I wonder whether these problems are caused by bureaucratic and incompetent developers, poor SQL engine or both.


Why so harsh ? If there’s one reality about software serving so many people with so many developers is that there will be bugs and blind spots.


Bureaucratic and incompetent engineering management is more likely than bureaucratic and incompetent developers.


They're the same thing.


I'm the VP of Engineering from PingCAP.

As a devoted customer and a big fan of GitHub, it’s devastating to see service interruptions and our business impacted by these problems; as the team behind TiDB, we (PingCAP) believe there is something we can do to help solve the database high availability and scalability problem. We would like to propose the database team in GitHub to consider the TiDB platform. TiDB and TiKV can work as a scale-out MySQL and it has been battle-tested in all kinds of hyper-scale scenarios.

Additionally, just in case you didn’t do this, we also recommend GitHub considering the Chaos-Mesh project (https://github.com/pingcap/chaos-mesh) to do Chaos Engineering. Chaos Mesh was our internal chaos engineering platform and we open-sourced it on Dec. 31, 2019. It can be used for simulating different kinds of failures including network partition, flaky disk, node outage. You can easily use it to simulate failures in a test environment and confirm that the high availability works as expected.


If you read the published report, you'll see that only one of the issues are actually related to scalability in any way (and most newly created projects learned their lesson of not using auto-incrementing IDs for many reasons). The same goes for many companies, the downtime is not because of any scaling issues but rather maintenance, code deploys or other things where humans are involved.

So while the reward of you promoting your product on HN can be high, unless it's very specific and actually solves their need, it just looks spammy.


The big problem that I took away from the availability report was the one where they lost acknowledged writes. They might not need the scaling, but they do need the ability to lose the leader without losing writes. (It would be preferable to reject the writes and become unavailable rather than to acknowledge writes only to discard them minutes later. No external API user can handle the case where "Github acknowledged my writes, and I read them back, but now they're gone". But they can handle "Github returned 'service unavailable' when I made a write".)


Thanks for your comment. Yes, there is only one of the issues are actually related to scalability. So I mentioned Chaos-Mash as well. Hope it would be helpful to test large scale distributed system. Both of them can not solve all the issues. But I hope they would be helpful for some of the issues.


Can it simulate the overflow of a 32-bit autoincrementjng primary key? Should it?


Do you mean Chaos-Mesh or TiDB? For Chaos-Mesh, it is not a suitable tool for deterministic unit test case (like inserting an overflow value and expect to get an error). For TiDB, it is easy to use `Alter Table` statement to enlarge the length of a integer column with zero impact on the online business.


Not sure if you're purposefully missing the question here.

ithkuil is not asking if you can change the length of the integer column but rather if you somehow would have prevented the issue GitHub saw here with hitting the limit. So something like automatically updating the length, or giving the developers warnings, or alerts of some sort.

My guess would be no, as I haven't seen this in the wild.

And if the answer is indeed "no", your comment seems again to plug something that wouldn't actually solve the problem GitHub saw here.


What capableweb said.

Furthermore, I wonder how long does it take to enlarge a pk column using alter table on a database that has a significant fraction of 4 billion rows?

EDIT: I'm not arguing that github would not benefit from TiDB; they might be genuinely better of with it on many fronts. I just wonder if it's fair to plug a product this way


Thanks for the clarification. There is not automatically way to enlarge the length. And I think that is not the things should be done by the database itself. That's my own opinion. But I think it would be helpful to show a warning to DBA when the allocated value is close to the upper limit.

For the following question, the answer is in seconds. To enlarge the column length, TiDB just need to modify the meta data. "how long does it take to enlarge a pk column using alter table on a database that has a significant fraction of 4 billion rows?"


> There is not automatically way to enlarge the length. And I think that is not the things should be done by the database itself.

Sorry; it's hard to balance terseness with clarity. What I meant is to ask whether a tool such as Chaos-Mesh could have helped the github operations team to be aware of such a failure mode by causing this kinds of problems early.

I think that's the purpose of chaos-engineering right? Some things happen too infrequently for us to plan for them, until when they finally do, causing problems that are more costly to solve than the cost of the prevention.

It's still unclear to me whether this particular failure mode is common enough to be part of what a chaos-engineering tool explores by default; after all, not all databases around contain billions of rows. If engineers have to explicitly enable such a test because they are afraid of what would happen if a PK overflows, well, then probably they would take other preventive measures so that the problem doesn't happen in the first place (either preventively upgrading the size of the counter, or at least put some monitoring). I think what happened here is that a solution was designed when the database was way smaller; the user base grew and grew and this particular failure mode was overlooked (because there are many many other things to care about!).

(I write this because I'm genuinely curious about what other people think about this topic, not to bash on parent for a shameless plug; I mentioned that only as a side comment; I heard good things about TiDB, I wish you well)


>>> It's still unclear to me whether this particular failure mode is common enough

It is very common. Had the same thing happen in my company last year, on multiple systems! The organization was reaching a decade old, nobody thought of that when the table were initially created.


Yeah I now realize it's not necessarily 4billion rows actively used, but just 4billion unique keys being allocated


> we believe there is something we can do to help solve the database high availability and scalability problem

> There is not automatically way to enlarge the length

> I think that is not the things should be done by the database itself

> would be helpful to show a warning to DBA

So, in the end, TiDB does nothing to actually solve the problem they experienced but you still feel comfortable offering them your own solution? Sounds like you're a VP of Marketing, not a VP of Engineering. Leaves a very sour taste when spamming a forum like this. I hope your bosses don't read HN and sees your comments plugging TiDB everywhere where it's not relevant.


well said, we will take note; We were just trying to help anything related to MySQL's scalability


> TiDB just need to modify the meta data.

thanks, that's interesting. So existing data using the previous 32-bit encoded values happily coexist with the new data being written after the migration?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: