Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This response indicates lack of experience to me. I could say the same in return. I've worked in a few data centres and they've all put redundancy at the forefront of their design. (even to the point of having multiple physical fibre paths for fail over).

> No systems are immune to failure. No matter how much redundancy you have

You have that backwards. No systems are immune to failure, which is why you have redundancy.

> chances are you have interdependencies you did not anticipate, and sooner or later run into failure scenarios that violate your expectations.

If the dependencies haven't been anticipated then someone isn't doing their job right. There's a reason why incident response / disaster recovery / business continuity plans are written. It should be someone's job to think up every "what if" scenario ranging from each and every bit of kit dying, all your staff winning the lottery and walking out the next day and even terrorist attacks. I've even had to account for what would happen if nukes were dropped on the city where our main data center was housed (though the answer to that was a simple one: nobody would care that our site went offline). It might sound clichéd, but people get paid to expect the unexpected and work out how to maintain business continuity.

> It's very well possible that Cloudflare messed up here, but to claim so categorically that "human error is unacceptable" is a bit of a joke. We build systems to withstand the risks we know about, and guess at some we don't.

This was their infrastructure failing. If you own and maintain the infrastructure then you have no excuse not to work out what might happen if each and every part of that infrastructure failed. (trust me, I have had to do this in my last two jobs - despite your accusations of my "lack of experience" ;) ).

> But the number of possible failure scenarios we don't understand properly is pretty much infinite.

You're confusing cause and effect. The number of different causes for failure is infinite. But the effect is finite. For example: a server could crash for any number of reasons (hardware, software, user error, and all the different ways within those categories), but the end result is the same; the server has crashed. Thus what you do is plan for situations when different services fail (staff do not turn up for work, your domain name services stop responding, etc) and plan some kind of redundancy around that, thus giving a little more breathing time for engineers to fix the issue and with the minimum possible disruption to your users. As Cloudflare had to resort to Twitter to update their users, they completely failed every possible aspect of such planning. And given the high profile sites that depend on Cloudflare, they have no excuses.

If this happened in any of the other companies I worked for, I'd genuinely be fearful for my job as a crash of that magnitude would me that I hadn't done my job properly.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: