DNS Outage Post Mortem

jjoe · on Jan 18, 2014

These are some of the corner cases that are put on the back burner en route to delivering an MVP. Just like you don't do early optimization of an infrastructure, you almost never enumerate all possible issues that can crop up under a less than ideal situation.

This isn't a GitHub only issue but rather one that would affect all quick-to-launch startups (most). What I'm learning from this is that one needs to regularly revisit the infrastructure and how it's glued together with the provisioning system.

If it's not broken, break it.

mahmoudimus · on Jan 19, 2014

+1. This is invaluable. This falls in line within a larger frame of thinking -- "immutable infrastructure." Schedule time to regularly provision your entire stack from ground up without any of the caching optimizations and have it run in production.

With tools like Chef & AWS CloudFormation, there shouldn't be an excuse.

bscanlan · on Jan 18, 2014

"an initial verification led us to believe the changes had been rolled out successfully"

I would love more detail in the type II error in this validation step, and is worth exploring deeper. What was the verification step? Why did it not detect the issue? What review process was used for the verification step?

While the failed verification step is not the root cause, having good safety checks are the most important part of planning good changes, whether they're DNS reconfigurations, network changes or software deployments.

jlgaddis · on Jan 19, 2014

I was surprised that the total amount of time between rolling out to the first servers, waiting, verifying, and then rolling out to the second set of servers was a whopping nine minutes.

Maybe I'm just too careful (perhaps because I've seen it happen before) but I prefer to wait a helluva lot longer than that for verification.

And perhaps it's because I dealt with Microsoft Active Directory so much in the past but I am extremely careful when it comes to DNS. If there's one thing that'll screw up your entire environment (especially in an AD-based network), it's broken DNS.

badmadrad · on Jan 19, 2014

This is why i like Chef...i feel there are tools out there to test your code better....FoodCritic...ChefSpec...Test Kitchen...before rolling to production and having to validate machines in production....ouch

overworkedasian · on Jan 18, 2014

Who in the right mind would schedule a critical infrastructure upgrade during the day?

colmmacc · on Jan 18, 2014

That's a very tough call to make. Making changes at off-peak times makes some sense, especially if you think there's a likelihood of disruption. But if you're planning for the unexpected, it can be best to make changes when people are their most alert and when plenty of help is around and available. Also, in a fast moving quickly growing business, there's really only so often you can come in at very anti-social hours before burning out.

MetaCosm · on Jan 18, 2014

It is always day somewhere (see this visually specifically to github: http://aasen.in/github_globe/). Best do the work when YOUR A-Team is available, awake, and alert.

I doubt there is a time when they wouldn't have disrupted a significant part of their userbase. Even if you assume a specific place has the majority of users (San Fransisco, Germany, whatever) developers tend to work odd hours anyway.

calpaterson · on Jan 18, 2014

Firstly, deploying during the day is good because if something goes wrong, the entire team is physically present to deal with it. Second, Github is used worldwide, which means 24 hours a day, so there isn't a "night" to deploy during.

kawsper · on Jan 18, 2014

What is the difference between day and night when your users are worldwide?

colmmacc · on Jan 18, 2014

There is significant variance in population per timezone, and even more significant variance in internet-usage per time zone. Much of this variance is just demographic, but most of it is actually geographic. An interesting and convenient thing about the present layout of the world is that the Pacific Ocean takes up almost half of it, and almost half of the world's land masses are uninhabitable tundra and desert (that's not so relevant to time peaks though).

This has the great effect of lowering the median travel times and information transmission latencies between the world's population centers, and it means that for at least this geological epoch; we're always going to have daily global peak and off-peak times for human-driven activity.

ook · on Jan 18, 2014

A reasonable approximation is to model a sine wave per region with peak amplitude based on typical usage patterns. For example entertainment services will peak in evenings and at weekends, business services will peak during typical core hours Monday to Friday. With enough users spread across enough regions you never have a good time for all users so in practice it is better to engineer things so you can do deploys / maintenance whenever is best for the teams working on the service.

iwasphone · on Jan 18, 2014

13:20 PST = 16:20 EST

nullrouted · on Jan 19, 2014

What a seriously dumb outage to have. I'm still confused about it after reading the the RFO.