Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DNS Outage Post Mortem (github.com/blog)
55 points by streeter on Jan 18, 2014 | hide | past | favorite | 14 comments


These are some of the corner cases that are put on the back burner en route to delivering an MVP. Just like you don't do early optimization of an infrastructure, you almost never enumerate all possible issues that can crop up under a less than ideal situation.

This isn't a GitHub only issue but rather one that would affect all quick-to-launch startups (most). What I'm learning from this is that one needs to regularly revisit the infrastructure and how it's glued together with the provisioning system.

If it's not broken, break it.


+1. This is invaluable. This falls in line within a larger frame of thinking -- "immutable infrastructure." Schedule time to regularly provision your entire stack from ground up without any of the caching optimizations and have it run in production.

With tools like Chef & AWS CloudFormation, there shouldn't be an excuse.


"an initial verification led us to believe the changes had been rolled out successfully"

I would love more detail in the type II error in this validation step, and is worth exploring deeper. What was the verification step? Why did it not detect the issue? What review process was used for the verification step?

While the failed verification step is not the root cause, having good safety checks are the most important part of planning good changes, whether they're DNS reconfigurations, network changes or software deployments.


I was surprised that the total amount of time between rolling out to the first servers, waiting, verifying, and then rolling out to the second set of servers was a whopping nine minutes.

Maybe I'm just too careful (perhaps because I've seen it happen before) but I prefer to wait a helluva lot longer than that for verification.

And perhaps it's because I dealt with Microsoft Active Directory so much in the past but I am extremely careful when it comes to DNS. If there's one thing that'll screw up your entire environment (especially in an AD-based network), it's broken DNS.


This is why i like Chef...i feel there are tools out there to test your code better....FoodCritic...ChefSpec...Test Kitchen...before rolling to production and having to validate machines in production....ouch


Who in the right mind would schedule a critical infrastructure upgrade during the day?


That's a very tough call to make. Making changes at off-peak times makes some sense, especially if you think there's a likelihood of disruption. But if you're planning for the unexpected, it can be best to make changes when people are their most alert and when plenty of help is around and available. Also, in a fast moving quickly growing business, there's really only so often you can come in at very anti-social hours before burning out.


It is always day somewhere (see this visually specifically to github: http://aasen.in/github_globe/). Best do the work when YOUR A-Team is available, awake, and alert.

I doubt there is a time when they wouldn't have disrupted a significant part of their userbase. Even if you assume a specific place has the majority of users (San Fransisco, Germany, whatever) developers tend to work odd hours anyway.


Firstly, deploying during the day is good because if something goes wrong, the entire team is physically present to deal with it. Second, Github is used worldwide, which means 24 hours a day, so there isn't a "night" to deploy during.


What is the difference between day and night when your users are worldwide?


There is significant variance in population per timezone, and even more significant variance in internet-usage per time zone. Much of this variance is just demographic, but most of it is actually geographic. An interesting and convenient thing about the present layout of the world is that the Pacific Ocean takes up almost half of it, and almost half of the world's land masses are uninhabitable tundra and desert (that's not so relevant to time peaks though).

This has the great effect of lowering the median travel times and information transmission latencies between the world's population centers, and it means that for at least this geological epoch; we're always going to have daily global peak and off-peak times for human-driven activity.


A reasonable approximation is to model a sine wave per region with peak amplitude based on typical usage patterns. For example entertainment services will peak in evenings and at weekends, business services will peak during typical core hours Monday to Friday. With enough users spread across enough regions you never have a good time for all users so in practice it is better to engineer things so you can do deploys / maintenance whenever is best for the teams working on the service.


13:20 PST = 16:20 EST


What a seriously dumb outage to have. I'm still confused about it after reading the the RFO.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: