Summary of the Amazon S3 Service Disruption

ajross · on March 2, 2017

> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.

mabbo · on March 2, 2017

Look at the language used though. This is saying very loudly "Look, this isn't the engineer's fault here". It's one thing I miss about Amazon's culture- not blaming people when system's fail.

The follow-up doesn't bullshit with "extra training to make sure no one does this again", it says (effectively) "we're going to make this impossible to happen again, even if someone makes a mistake".

Bartweiss · on March 2, 2017

Any time I see "we're going to train everyone better" or "we're going to fire the guy who did it", all I can read is "this will happen again". You can't actually solve the problem of user error with training, and it's good to see Amazon not playing that game.

jacquesm · on March 2, 2017

What bothered me about running TrueTech is that customers would sometimes demand repercussions against employees for making mistakes.

Enter Frans Plugge. Whenever a customer would get into that mode we'd fire Frans. This was easy, simply because he didn't exist in the first place (his name was pulled from a skit by two Dutch comedians, bonus points if you know who and which skit).

This usually then caused the customer to recant on how he/she never meant for anybody to get fired...

It was a funny solution and we got away with it for years, for one because it was pretty rare to get customers that mad to begin with and for another because Frans never wrote any blog posts about it ;)

But I was always waiting for that call from the labor board to ask why we fired someone for who there was no record of employment.

justin_oaks · on March 3, 2017

It is unreasonable for me to think that company owners should have the spine to say, "We take the decision to fire someone very seriously. We'll take your comments under consideration, but we retain sole discretion over such decisions."

It irks me that businesses fire people because of pressure from clients or social media. But having never been the boss, I may be missing something.

patio11 · on March 3, 2017

One reason to like a facet of Japanese management culture: if a customer wants someone to rake over the coals you offer management, not employees.

Internal repercussions notwithstanding, externally the company is a united front. It cannot cause mistakes by luck, accident, or happenstance, because the world includes luck, accidents, and happenstance, so any user-visible error is ipso facto a failure of management.

mml · on March 3, 2017

apparently this is (or was) a job in japan. companies would hire what amounts to an actor to get screamed at by the angry customer, and pretend to get fired on the spot. rinse, repeat whenever such appeasement is required.

marklyon · on March 3, 2017

I know one person who does this for real estate developers. He gets involved in contentious projects early on, goes to community meetings, offers testimony before the city council, etc. When construction gets going and people inevitably get pissed about some aspect of the project, he gets publicly fired to deflect the blame while the project moves on. Have seen it happen on three different projects in two cities now and, somehow, nobody catches on.

codeisawesome · on March 4, 2017

I don't know how to describe this in a single word or phrase appropriately, but I think it is a "genius problem" to exist. Not a genius solution. I feel that the problem itself is impressive and rich in layers of human nature, local culture etc - but once you have such a problem any average person could come up with a similar solution, because it is obvious.

It's still mind blowing and very amusing that this is a thing in our world!

patio11 · on March 3, 2017

Do you have a citation for that? I am curious; it's something I've never heard of and goes against my intuitions/experience regarding what traditionally managed Japanese companies would do. (Entirely possible it has happened! Hence the cite request.)

thr0waway1239 · on March 3, 2017

Imagine if the customer saw the same actor getting fired in different companies! Is the customer going to catch on? More likely, they will think "Yeah, no wonder there was a problem. This same incompetent dude wormed his way into this company too" :-)

Bartweiss · on March 3, 2017

There's a movie sketch in here somewhere. A guy has the worst day of his life, every single thing goes wrong, and at every single company the same person is "responsible" for the issue.

thr0waway1239 · on March 3, 2017

Awesome! The first movie ever made based on an anonymous comment on HN. Wait. So I can't get a cut in the profits then?

kriro · on March 3, 2017

Well, well, well...scribbles note into Trello: "Today I got fired, it was all my fault, our poor customers suffered" Blog posts as a service.

gioele · on March 3, 2017

Inspired by the Daniel Pennac's novels? https://fr.wikipedia.org/wiki/Saga_Malauss%C3%A8ne

ascorbic · on March 3, 2017

The BBC staff have a term for the way the corporation almost does this: "deputy heads will roll"

gadders · on March 3, 2017

I always thought that was more a cynical take on the fact that the top guy was protected, rather than underlings.

ascorbic · on March 3, 2017

Yes, exactly. Hence almost.

Laforet · on March 3, 2017

Historically some cultures practiced mock firing as a way to appease an angry customer. This was back in the day when most business transactions occured face to face so the owner should demand the employee to pack their belonging and leave the premises in full view of the customer. Of course this is all for show but this kind of public humiliation seems to satisfy even the most difficult customers.

jacobush · on March 3, 2017

Even when the customer knew it was show, I see it as a way of saying, "yes, we acknowledge that we screwed up and make a public, highly visible note of it that will be recorded in the annals of peoples' gossip in this area".

pilsetnieks · on March 3, 2017

On the contrary, I think it's quite reasonable.

aanm1988 · on March 3, 2017

Did you ever insist that Frans must be fired and refuse to accept the "we didn't mean it?"

Cause that sounds pretty great.

jacquesm · on March 3, 2017

Well, by then he was fired... :)

aanm1988 · on March 3, 2017

and mercilessly beaten?

pieterr · on March 3, 2017

Koot en Bie - Mannen Bellen

https://open.spotify.com/track/0tzlDFcFSNRUaaPnINWo8B

jacquesm · on March 3, 2017

That's one deserved upvote :)

cprecioso · on March 4, 2017

To what lengths did you keep that up? Did you just tell the client that informally, formally in a reunion or so, or did you actually put it in writing or even fake some "firing" paperwork?

problems · on March 2, 2017

This is true in some cases, but not when mitigations aren't practiced properly - it's not the fat fingered user who should be fired or retrained, but the designer or maintainer of the system that allowed it to become a serious issue.

Look at the recent GitLab incident - one guy messed up and nuked a server. Okay, that happens sometimes, go to backups. Uh oh, all the backups are broken. Minor momentary problem just turned into a major multi-day one.

That's a problem and one which could be preventable with training (or, arguably, firing and hiring). Maintaining your backups properly should be someone's duty, designing and testing systems to minimize impact of user error should be too.

Bartweiss · on March 2, 2017

Fair enough. I guess what I meant was specifically using training or punishment to combat "momentary lapse" issues.

If someone doesn't test their backups, you train them to test backups. If someone lies about testing the backups, maybe you fire them. But if someone trips and shatters the only backup disk, you don't yell at them - you create backups that an instant of clumsiness can't ruin.

I did overstate, training is perfectly reasonable, but I often see it cited exactly when it shouldn't be, as a solution to errors like typos or forgetfulness.

hibikir · on March 3, 2017

Training about testing backups is still a bad idea: Why make someone do a job that purely verification? Those jobs eventually stop getting done, and it's hard to keep people doing them.

Instead, you make a machine verify the backups simply by using the backups all the time. For example, at work I feed part of our data pipeline with backups: Those processes have no access to the live data. If the backups break, those processes would provide bad information to the users, and people would come complaining in a matter of minutes.

Just like when you have a set of backup servers, you don't leave them collecting dust, or tell someone to go look at them every once in a while: you just route 1% of the traffic through them. They are still extra capacity, you can still do all kinds of things to them without too much trouble, but you know they are always working.

Never, ever, force people to do things they don't gain anything from. Their discipline would fade, just like it fades when you force them to a project management tool they get no value from.

evilDagmar · on March 3, 2017

No. You don't make a daily task of testing backups. That would be wrong for precisely the reasons you cite. It's a waste of effort and time, and ignores what the point of testing them is for: ensuring that the procedure still works.

One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data. The rest of the year it's only worth keeping an automated process reporting whether or not the things are being made, and people keeping an eye on change management to be sure no changes are made to the known-to-be working process that can break it without the new process incurring an explicit vetting cycle. Gitlab wasn't apparently testing or engaging in monitoring what was supposed to be an automated process. That's where they got burned.

Process monitoring may be boring as hell, but it's seldom wasted effort, and will prevent massive, compounded headaches from bringing operations to a chaotic halt.

dbenhur · on March 3, 2017

> One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data.

Nope. Nope. Nope.

You test every backup by automatically restoring from it in a sandbox and verifying its integrity and functionality in the restored state.

Backups are worthless unless verified for their intended use of recovering a functioning system.

XorNot · on March 3, 2017

You're imagining an automated test system. But Gitlabs problem was the automated system was not communicating failures properly.

And constant "this succeeded" messages don't scale well.

cjs · on March 3, 2017

I agree. I have the same policy when setting up servers: don't have a "primary" and a "backup" server, make both servers production servers and have the code that uses them alternate between them, pick a random one, whatever. (I don't always get to implement this policy, of course.)

Bartweiss · on March 3, 2017

This makes some sense, but I don't think it negates testing backups? For duplicate live data, yeah, you can't just use both. But most businesses have at least some things backed up to cold storage, and that still needs to be popped in a tape deck (or whatever's relevant) and verified.

beat · on March 2, 2017

I don't buy the "If we just plan ENOUGH, disasters will never occur" argument. It's the universe is just too darn interesting for us to be able to plan enough to prevent it from being interesting.

JupiterMoon · on March 2, 2017

It is all about hazard/risk modeling and mitigation. E.g.

Someone rm -rf / ing the server will happen eventually with near 100% certainty in any company and can be mitigated by tested, regular, multiply redundant backups.

Cosmic rays flipping bits will happen with near 100% probability at the scale someone like Amazon works at can by mitigated by redundant copies and filesystems with checksum style checks. Similar with hard drive failure.

Earthquakes will happen in some areas with near certainty over the time periods companies like Amazon presumably hope to be in business and could be mitigated by having multiple datacenters and well constructed buildings. Similar for 'normal' scale volcanoes.

Fires will happen but they can be mitigated (with appropriate buildings and redundancy).

Small meteorite stikes are unlikely but can be mitigated by redundancy.

Solar activity causing an electomagnetic storm - yeah one can shield one's datacenter in a Faraday cage but in this situation the whole world is probably in chaos and one's datacenter will be the least of one's concerns (unless shielding become standard in which case you'd better be doing it). Similar applies for nuclear war, super volcanoes, massive meteorite strikes or other global events at the interesting end of the scale.

But yeah there are going to be things that get missed. They key is having an organization that (1) learns from its mistakes and (2) learns from others' mistakes and continually keeps their risk modeling and mitigation measures up to date. And note that many of the hazards that are worth mitigating have the same mitigation i.e. redundancy (at different scales).

sah2ed · on March 3, 2017

"[T]he universe is just too darn interesting for us to be able to plan enough to prevent it from being interesting." -- Beat

That's a great line. How should I attribute it?

beat · on March 3, 2017

If you're actually quoting me somewhere, either use "Dave Stagner", or "Some asshole on the internet". Same diff.

ufmace · on March 3, 2017

Giving people training in response to things like this always seemed a little strange to me - that particular person just got the most effective training the world has ever seen. If you look at it that way, you could say that everything Amazon spent responding to this was actually a training expense for this particular person and team. After you've already done that, it seems silly to make them sit through some online quiz or PowerPoint by a supposed guru and think you're accomplishing anything.

awj · on March 2, 2017

Yeah indeed. You know who the one person at Amazon is that I'd expect to never fat finger a sensitive command ever ever again? The guy who managed to fat finger S3 on Tuesday. Firing him over this mistake is worse than pointless, it offers absolution to every other developer and system that helped cause this event.

ghurtado · on March 2, 2017

I'm guessing your comment was inspired by this: http://www.squawkpoint.com/2014/01/criticism/

awj · on March 2, 2017

Not directly, but maybe that was running around in the back of my mind while I responded to it.

vitalus · on March 3, 2017

I wouldn't merely call this a fat finger or typo - it's quite possible that the usage of the tool itself was so nefarious that mistakes would be impossible to avoid, given the complexity of its inputs.

Based on Amazon's decision to improve the tooling such that this category of error would be (hopefully) impossible to reproduce, I would lean more towards that being the case.

mulmen · on March 2, 2017

I think the value of making these mistakes is in learning from them and then making sure they can't happen again. Leaving this process in place and just making this guy run the command forever because he screwed it up once would be a much less effective solution than fixing the tooling so it's impossible to do this in the first place. Saying "this guy don't do it again" also offers absolution to everyone else on the team. In a healthy culture only we can fail.

bbcbasic · on March 2, 2017

That old chestnut. Is it true?

ellyagg · on March 2, 2017

Is it not true for you? I know that I'm personally good at avoiding the same mistake. I'm also extraordinarily good at avoiding repeating catastrophic mistakes. I generally change my processes in the same way that Amazon is changing their processes to avoid this mistake.

bbcbasic · on March 2, 2017

I am not talking about what Amazon are doing, but the concept that the individual wont make the same mistake again, which is what the grandparent is getting that.

He won't make the same mistake because no one makes the same big mistake twice? I wouldn't bank on that alone.

linsomniac · on March 3, 2017

Years ago I read a story about a fat-fingered ops person getting called into the CEO's office after an outage. "I thought you were calling me in to fire me." "I can't afford to fire you, today I spent a million dollars training you."

user5994461 · on March 4, 2017

> You can't actually solve the problem of user error with training, and it's good to see Amazon not playing that game.

The problem of user error can be mitigated by an appropriate level of OCD.

But OCD can't be trained, you either have it or you don't.

djsumdog · on March 2, 2017

Which is really the point of automation and configuration management. When a manager asks you, "How are you going to prevent this in the future?" You can say, "We added a check so n must be less than x% of the total number of cluster members," or "We added additional unit tests for the missing area of coverage" or "We added new integration tests that will pick up on this."

Tests and configuration scripts don't prevent all breakage. But when you have them, you can say, "We missed that, let's add it," or "That failed, but it's a false positive. Let's add this edge case to this test."

If you have no automation, tests or auditing systems around running deployments, you can't do any of this.

jegoodwin3 · on March 2, 2017

I agree testing and automation are good. I think they need to go beyond this to formal verification, for something on this scale and reliability. NASA doesn't make these sorts of mistakes.

By the way - this is not just Amazon's problem now. We know the internet has a single point of failure. So does a lot of IoT.

When will we experience the first Suicide DevOps?

gmac · on March 2, 2017

NASA doesn't make these sorts of mistakes

https://www.wired.com/2010/11/1110mars-climate-observer-repo...

lisper · on March 2, 2017

https://www.youtube.com/watch?v=6OalIW1yL-k

(Specifically https://www.youtube.com/watch?v=6OalIW1yL-k#t=3m but it's worth watching the whole clip (or even the whole movie) if you haven't seen it before. It's from Terry Gilliam's "Brazil".)

icebraining · on March 2, 2017

Almost twenty years ago, though.

pron · on March 3, 2017

Well, they've had plenty of opportunities to learn from their mistakes; Amazon hasn't had this long.

halomru · on March 2, 2017

>We know the internet has a single point of failure.

It has? I have yet to see the day where I can neither reach my email provider nor Google nor Hackernews. My local provider might screw up occasionally, or some number of of websites go unreachable for whatever reason. But I fail to come up with anything short of cutting multiple see cables that causes more than 50% of servers to be unreachable to more than 50% of users.

jsudhams · on March 3, 2017

https://en.wikipedia.org/wiki/2008_submarine_cable_disruptio...

pron · on March 3, 2017

Amazon do formally verify AWS (they use TLA+), which is probably why this failure is a human error. Of course, you could expand the formal analysis of the system to include all possible operator interactions, but you'll need to draw the line at some point. NASA certainly makes human errors that result in catastrophic failures. The Challenger disaster was also a result of human error to a large degree[1]; to quote Wikipedia: "The Rogers Commission found NASA's organizational culture and decision-making processes had been key contributing factors to the accident, with the agency violating its own safety rules."

[1]: https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disas...

antaviana · on March 2, 2017

I presume this is well entrenched in the Amazon culture.

Jeff Bezos once said: "Good intentions never work, you need good mechanisms to make anything happen"

mabbo · on March 2, 2017

That's exactly it. Amazon doesn't like sharing all that much, but I wish they'd publicly release that video.

liquidise · on March 2, 2017

This is the major basis of the CMM Levels [1]. At higher levels of maturity and necessity, systems and processes are designed to increasingly prevent errors from reaching a production environment.

Amazon is taking the right approach here. The fact that a system as complex and important as S3 can be taken down is a failure of the system, not the person who took it down accidentally.

1. https://en.wikipedia.org/wiki/Capability_Maturity_Model#Leve...

geodel · on March 2, 2017

A lot of IT vendors I have worked with, they all were CMM/CMMi level 5. But the crappiness in their work development/process/deployment etc make me wonder if all their efforts go in attaining those certifications as oppose to doing something better.

vaishaksuresh · on March 2, 2017

As someone who worked for an IT vendor with certification and as someone who was part of the certification team at another place, I can assure you that you're right.

The certification is more for the organization/unit and the people working do not realize what they are for. Another thing that usually becomes a problem is the rigidity of the certification. Saying you need X, Y and Z documented is easy, but it doesn't work for projects that maybe don't have Y. So people make up documentation and process just to be compliant, this soon becomes a hinderance to the work. At this point people either abandon the process or follow it and the work suffers.

_ao789 · on March 2, 2017

Thank you for adding this comment. I am glad there are more people out there that aren't afraid to be honest about some of the nonsense 'follow the process no matter what' stuff that I have experienced over the years.

groby_b · on March 3, 2017

CMM level 5 ==> You have a well-documented, repeatable, and still horrible process that declares all errors statistically uncommon by "augmenting" the root cause with random factors. Insta-certification.

(I lied about the "insta" part)

refulgentis · on March 2, 2017

How laudable is this, really?

I've had the privilege of either working for myself, the company that acquired mine and let me run the dev, or at Google. From that perspective, and what I understand about ops, the rarity is not having the attitude mentioned in the parent.

simplify · on March 2, 2017

Are you suggesting we take their behavior for granted? Positive behavior needs to be praised – it's part of how society influences its members.

refulgentis · on March 2, 2017

hellbanner · on March 2, 2017

This is good. And for the software engineers, great. I've heard from people doing the grunt work at amazon -- warehouse staff -- that Amazon incentivises employees to rat out each other for mishandling, late time etc, fostering intense competition

mabbo · on March 2, 2017

I spent time in the fulfillment centers, writing software for them. I definitely didn't see that sort of thing. There's no need- the software tracked everything they did. Low performer would be found and retained or 'promoted to customer' without the need for anyone to 'rat out'.

Plus, managing humans in a 'rat out' system would be incredibly inefficient. Now you need lots of employees just to listen to the ratting!

gigatexal · on March 2, 2017

Yup. And for that mercy the engineer is going to be that much more careful, and loyal. I would be, that's for sure.

mulmen · on March 2, 2017

Agreed, especially regarding the culture but isn't this pretty much the same explanation they gave a few years ago when something similar happened?

I seem to recall an EC2 or S3 outage a few years ago that boiled down to an engineer pushing out a patch that broke an entire region when it was supposed to be a phased deployment.

I could be mis-remembering that but it's important that these lessons be applied across the whole company (at least AWS) so it would be a bigger mark against AWS if this is a result of similar tooling to what caused a previous outage.

cjbprime · on March 3, 2017

Pretty sure that one was a Microsoft Azure outage.

(Source: am a self-identified post-mortems connoisseur. :)

evilDagmar · on March 3, 2017

Not a bad plan. If you don't make enough mistakes on your own, ya gotta learn from the mistakes of others as a preventative.

sah2ed · on March 3, 2017

Do you by chance keep a public log of your postmortem collection :)?

cjbprime · on March 3, 2017

I don't, but danluu does! https://github.com/danluu/post-mortems

bobzimuta · on March 4, 2017

Yeah an EC2 engineer switched over traffic to a backup network connection that had significantly less bandwidth, triggering cascading failures.

philtrem00 · on March 2, 2017

Yeh, makes sense to make changes to the system rather than do nothing and just blame someone. Errors happen, it's something you can't avoid.

jordache · on March 2, 2017

that's just a public statement. how do you whether the individual was reprimanded

mabbo · on March 2, 2017

Because in my five years as an Amazon dev, that's exactly the attitude I witnessed. People are trying their best, so firing them won't help.

mrep · on March 3, 2017

I believe Jeff once said something along the lines of "why would I fire an employee that made an honest mistake? I just spent a bunch of money teaching him a lesson"

flamedoge · on March 2, 2017

lol what part of "these things should be done proactively and tested over and over in CI" does not make sense to management?

pmoriarty · on March 2, 2017

Putting the capability to take down S3 in to the hands of a single engineer seems a bit much.

Is mere extra training the right solution here?

Maybe they need something like the procedure that's used in missile silos:

Not allowing the shutdown system to function at all without the explicit authorization of least two people.

notatoad · on March 2, 2017

The linked article also says the tools they use were changed to limit the amount of resources that could be taken down at a single time, the speed they could be taken down at, and a hard floor was put on the number of instances that could be stopped.

That's a lot more than just extra training, and a lot better than a two-key system.

ceejayoz · on March 2, 2017

> Maybe they need something like the procedure that's used in missile silos...

Probably a bad example. The system was a pain in the ass, so they went and circumvented some of its restrictions.

http://gizmodo.com/for-20-years-the-nuclear-launch-code-at-u...

> Those in the U.S. that had been fitted with the devices, such as ones in the Minuteman Silos, were installed under the close scrutiny of Robert McNamara, JFK's Secretary of Defence. However, The Strategic Air Command greatly resented McNamara's presence and almost as soon as he left, the code to launch the missile's, all 50 of them, was set to 00000000.

> Oh, and in case you actually did forget the code, it was handily written down on a checklist handed out to the soldiers.

timdorr · on March 2, 2017

I think you have it backwards. The post does not say they will simply be training the problem away. They are putting safeguards into their tooling to prevent the case of a fat finger.

baq · on March 2, 2017

The article leaves little doubt that they didn't know such an event would be so hard to recover from. They knew it wouldn't be easy, but they were surprised by how bad it was.

agency · on March 2, 2017

To make error is human. To propagate error to all server in automatic way is #devops - DevOps Borat

jerf · on March 2, 2017

I've long said something like "To err is human. To fuck up a million times in a second you need a computer."

I may have to upgrade that to take the mighty power of Cloud (TM) into account, though. Billions and trillions of fuck ups per second are now well within reach!

M_Grey · on March 2, 2017

I can't wait until quantum computing lets us add a degree of simultaneity to fucking up. Fuck up in many ways... AT ONCE!

duskwuff · on March 2, 2017

Quantum computing: giving humans the unprecedented ability to make every possible error at once

nemacol · on March 2, 2017

It will be fucked, not fucked, neither, and both... until we look. I feel bad for the poor bastard that has to look...

M_Grey · on March 2, 2017

Internship in the future just got a whole lot bleaker.

make3 · on March 2, 2017

shrodinger's buttocks

beamatronic · on March 2, 2017

It will not be certain if you have fucked up or not until you actually go to check.

pcthrowaway · on March 3, 2017

But checking affects the outcome! https://en.wikipedia.org/wiki/Heisenbug

cjbprime · on March 3, 2017

> I've long said something like "To err is human. To fuck up a million times in a second you need a computer."

This quote (paraphrased) actually dates all the way back to 1969:

> To err is human; to really foul things up requires a computer.

-- http://quoteinvestigator.com/2010/12/07/foul-computer/

jerf · on March 3, 2017

Yes, it is. I believe I added the concept of "fuckups per second", but my memory being what it is and the general creativity of the internet being what it is, I would not be surprised that it either wasn't original or I wasn't the first.

ghurtado · on March 2, 2017

> "To err is human. To fuck up a million times in a second you need a computer."

If you made that up, I tip my hat off to you as payment for all my future uses of the phrase.

linsomniac · on March 3, 2017

"A computer lets you make more mistakes faster than any other invention with the possible exception of handguns and tequila." -- Mitch Ratcliffe

SilasX · on March 2, 2017

I would go with: "To err is human; to cascade, DevOps."

camperman · on March 2, 2017

In #devops is turtle all way down but at bottom is perl script - DevOps Borat

NikolaeVarius · on March 2, 2017

Automation doesn't just allow you to create/fix things faster. It also allows you to break things faster.

kevin2r · on March 2, 2017

We may think that an automated system requires less understanding of it in order to operate it. But from the other point of view, you have to know what you are doing, consequences of even an small change are big.

This is one of the things that happens with windows, getting up a server is so easy, that people believe that they don't have to understand what's under the hood, and then, we get a lot of miss-configuration and operational issues.

CobrastanJorji · on March 2, 2017

It's one of the reasons that silly guarantees like "twelve 9s of reliability" are meaningless. There are humans here. "Accidental human mishap" is gonna happen sometimes, and when it happens it's probably gonna affect a lot of data. Heck, at around 7 or 8 nines you have to account for the possibility that your operations team will decide that all your data is a vicious pack of timberwolves and needs to be defeated.

andrewaylett · on March 2, 2017

Note that's durability not reliability. You might not be able to get at it with every request (I think 99.99% is the target) but it'll still be there if you try again later.

idlewords · on March 2, 2017

The point is that at eleven nines, you're entering the realm of very rare/unlikely events that will also affect durability.

In other words, there's a lack of humility about "unknown unknowns".

notatoad · on March 2, 2017

but amazon doesn't offer eleven 9s of availability. I don't think anybody serious does, so arguing how silly eleven 9s of availability is is kind of pointless. The SLA is only four 9s of availability.

agrajag · on March 2, 2017

Not even four 9's - they only trigger SLA credits when they dip below 3 9's.

simonebrunozzi · on March 2, 2017

Note: they say "S3 is DESIGNED for 11 9s of durability". It's PR-speak to say that they don't give you any guarantee, but in theory the system is designed in a magnificent way.

cobookman · on March 2, 2017

11 9s of durability is about the likelyhood of AWS loosing your data. It doesn't cover the likelyhood of you being able to access your data that's called availability.

For example on GCS (Google's S3)...A storage class specifies how many locations the data is made available. All storage classes share the same durability (chance of google loosing your data) of 99.999999999%, but have different availability (chance of being able to retrieve data).

majewsky · on March 2, 2017

> chance of google not loosing your data of 99.999999999%

git commit -m 'typo'

Bartweiss · on March 2, 2017

I think it's a little better than that, actually.

It says that their ideal-case failure rate is 11 nines; that's how much you should lose to known, lasting issues like machines failing and cutting over.

Amazon's actual SLA offers 2 nines and 3 nines as the credit thresholds. So they're stating the reliability of their known system, and the rest is for events like this.

dkersten · on March 3, 2017

Durability and uptime are not the same thing. Durability is about the chance of losing your data and has nothing to do with service disruptions. Their uptime SLA is much lower. Looking at [1], it looks like the SLA says 3 9s (discounts given for anything lower) of uptime.

[1] https://aws.amazon.com/s3/sla/

pavel_lishin · on March 2, 2017

As I understand it, those guarantees don't mean that the service will actually stay up for the given number of 9s; it's that you'll be reimbursed monetarily if and when they go down.

SilasX · on March 2, 2017

I don't think it even means that; their policy says that the reimbursement only happens when your reliability dips all the way down to three 9s:

https://aws.amazon.com/s3/sla/

dkersten · on March 3, 2017

The SLA (as you linked) says three nines. The 12 nines quoted by others is durability, not uptime.

HeyLaughingBoy · on March 2, 2017

Kinda the same thing, though. I mean, from my perspective there's no substantive difference between me saying "this service will stay up 99.xx% of the time and me buying insurance to pay you for the 0.xx% of the time I might fail.

The alternative is that I use the insurance to pay my legal fees when you sue me for not meeting my uptime guarantees.

jstanley · on March 2, 2017

It's not the same thing. The Amazon service might only be costing you $100/mo, but if it goes down the cost to your business might be millions. They'll reimburse you the $100, not the millions.

gigatexal · on March 2, 2017

Yeah as soon as I read this I felt bad for the employee. I remember writing an update statement without a where clause and having to restore the table from backup. But that was at a company not as advanced as Amazon. Fat fingering a key like that is just crazy (but comforting that even at Amazon it happens) and I'm sure they've fixed that from happening again.

duskwuff · on March 2, 2017

FWIW: Setting "safe-updates=1" in ~/.my.cnf will require UPDATE and DELETE statements in the client to have a WHERE clause which references a key. It's not perfect protection, but it will save you from a lot of mistakes.

amichal · on March 2, 2017

That's awesome,

My worst DELETE fail however was:

  DELETE * FROM table WHERE [long condition that resolves to true for all records]

Now i write SELECT or SELECT COUNT(*) over and over again until i see the data i expect and then change it to a DELETE/UPDATE.

It's not my personal habit but some folks I know turn off auto commit and BEGIN a transaction every time they enter an interactive SQL sessions. They then default to ROLLBACK at least once before COMMITing them.

That and having a user with read-only permissions or a read replica

tacon · on March 2, 2017

Is there a collection of data safety tips like this somewhere? I never knew this existed. What else am I missing?

gigatexal · on March 2, 2017

hmm, that's kinda cool. I'm in a MS shop and I don't know if SMSS has the same feature. My manager just looked at me and said "welp, go restore the table and be more careful next time." I was a new DBA at the time, still, kinda new.

jsudhams · on March 3, 2017

USE BEGIN TRANS as mentioned above

vocatus_gate · on March 2, 2017

I once brought down our entire production XenServer cluster group by issuing a "shutdown now" in the wrong SSH window. Needless to say it was a bad feeling watching Nagios go crazy and realizing what had just happened.

Piskvorrr · on March 2, 2017

    root@baz # shutdown now
    W: molly-guard: SSH session detected!
    Please type in hostname of the machine to shutdown: foo
    Good thing I asked; I won't shutdown baz ...

Surprising to see such a simple protection neglected.

twunde · on March 3, 2017

I don't know how well-known molly-guard is, but I've never heard of it before. Definitely enabling it on my servers next week.

Piskvorrr · on March 3, 2017

Interesting. Up until now I've considered it well-known to the point of ubiquity :)

gigatexal · on March 2, 2017

Oh crap! Yeah I bet you were pretty panicked. My update statement destroyed data that my team used all the time so I was worried I'd get fired. Luckily that wasn't the case.

saalweachter · on March 2, 2017

Fat fingers are just nature's way of making sure you test your back-up & restore procedures periodically :-)

_flbt · on March 3, 2017

When I first started using Linux and wanted to do some housecleaning, I did "rm -r (asterix) in a folder. Cleaned up everything, no prob. Then went to some more folders, hit the up arrow on my keyboard fast to get to a command I had used before. Hit 'enter' before my brain realized I had landed on rm -r (asterix) and not the right command. Never used that command again.

emeraldd · on March 2, 2017

Automation tends to make those kinds of errors worse rather than better. Perhaps more infrequent and of a different nature than before, but screwing up an automated action cascades much, much faster than a human initiated one. As a result, you have to watch things a good deal closer and build in more and tighter safe guards.

For instance: https://thenextweb.com/shareables/2014/05/16/emory-universit...

Note: Automation is great, you just can't be sloppy with it. EVER.

edit:fix minor typo

bluejekyll · on March 2, 2017

> established playbook

A playbook actually represents a lack of automation for a particular task.

The playbook itself should be automated, with automated tests that validate its correctness.

inlined · on March 2, 2017

I've heard (and sometimes pushed) this rhetoric before, but something should be well understood before it's automated. Things that happen very rarely should be backed with a playbook + well exercised general monitoring and tools. This puts human discretion in front of the tools' use and makes sure ops is watching for any secondary effects. Ops grimorae can gather disparate one offs into common and tested tools but they don't do anything to consolidate the reason the tools might be needed.

bluejekyll · on March 2, 2017

To me that sounds like development and testing (i.e. figuring out what the steps are). Once you have that it should be automated fully.

Too often people will put up with the, "well, we only do this once a month so it's not worth automating". Literally, I script everything now, just in simple bash... if I type a command, I stick it into a script, and then run the script. Over time you go back and modify said script to be better, eventually this turns into more substantive application. At a certain point, around the time that you have more than one loop, are trying to do things based on different error scenarios, it's probably time to turn to rewriting it in another language.

The simplest thing this does for me, is guarantee that all the parameters needed are valid and present before continuing.

bpchaps · on March 2, 2017

I've been doing it this way for years and it really, really works. Some places have reservations with it since its lack of formality is considered "risky" by some.

Though, an alternative to switching to another language is using xargs well. Writing bash with some immutably has been pretty invaluable for my workflows lately. For example

  seq 1 10 | xargs -P10 -I{} ssh $host-{} hostname

beberlei · on March 2, 2017

Its probably their name for an automated admin task. The post does bot imply that this was merely a checklist of things to do. Ansible calls their automatiin receipts playbook as well.

seansmccullough · on March 2, 2017

It's probably a page on the internal Wiki that the S3 team follows for that particular task. Most the actual steps are probably automated, but it sounds more like a checklist.

I used to follow runbooks/playbooks written on the internal wiki when I worked at Amazon.

majewsky · on March 2, 2017

I don't think it means "playbook" in the Ansible sense. The dictionary (i.e. Wikipedia) definition of "playbook" is "a document defining one or more business process workflows aimed at ensuring a consistent response to situations commonly encountered during the operation of the business", and that's how I know it.

At $work, certain types of frequently-occurring alerts have playbooks that document how the alert in question can be diagnosed and how known causes can be remedied. Something like "Look at Grafana dashboard X. If metric Y is doing this and that thing, the cause is Z. Log on to box 16 and systemctl restart the foo.service."

bluejekyll · on March 2, 2017

Hm, based on the description, I would be surprised if they could fat-finger that.

emeraldd · on March 2, 2017

playbooks can take arguments: http://stackoverflow.com/questions/30662069/how-can-i-pass-v...

So, fat-fingering something is imminently possible.

tdicola · on March 2, 2017

To be fair, the real problem isn't that someone screwed up a playbook or command. The real problem is that a tiny mistake in a command can cause an entire service to be disrupted for hours. That's the problem that needs to be fixed.

gregtaleck · on March 2, 2017

"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future."

thewhitetulip · on March 3, 2017

I wonder as to why this wasn't thought of while creating the system? Of course, I don't have experience at that scale, so just wondering.

traviscj · on March 3, 2017

It seems like sometimes this is just how iteratively automating things works, especially on an internal-facing tool.

You have some process that starts out being "deploy this app with this java code". You deploy once and while, so it's not a big deal. But then those changes get a bit more frequent and so you pull out the common bits and the process becomes "make this YAML change in git and redeploy the app".

That works until you find yourself deploying 5 times a day, so you turn it into a MySQL table, and the process becomes "write a ROLL plan that executes this UPDATE x=y WHERE t=u; command"

After a while you get super annoyed at some quirk of the commands and figure, "Ok, fine, I'll just add an endpoint and some logic that just does this for the command case."

Then you wanna go on vacation and the new guy messed up the API request last week, so you figure, "I'll just add a little JS interface with a little red warning if the request is messed up in this way or that before I go".

You get back from vacation and some original interested party (whoever has wanted all these changes deployed) watched the intern make the change and thinks they could just do it themselves if they had access to the interface. You're wary, but you make the changes together a few times and maybe even add a little "wait-for-approval" node in the state machine.

Life is good. You've basically de-looped yourself, aside from a quick sanity check and button press, instead of what was a ~2 hour code + build + PR + PR approved + deploy process.

Then that interested party goes to work for Uber and the rest of your team adds a few functionalities on top of the interface you built and it all goes pretty well, until you realize that now that this thing that used to be 20 YAML objects is now 50k database records, and a bunch of them don't even apply anymore. So you build a button to disable some group of them, but after getting it deployed you realize it's actually possible to issue a "disable all" request accidentally if you click a button in your janky JS front-end before the 50k records download and get parsed and displayed. Oops! This mistake that you and the original interested party would have never made (because you spent the last 2 years thinking about all this crap) is probably a single impatient anxious mouse-click away from happening. So you make a patch and deploy that.

Congrats! You found that particular failure mode and added some protections for it, and maybe added some other protections like rate-limiting the deletions or updates or whatever. That's cool, but is that every failure mode? I bet it isn't. What happens when someone else thinks you have too many endpoints and just drops to SQL for the update?

Basically, yeah, of course you think of this stuff while iterating on it. But you figure "only power users are on the ACL" or "my teammates will understand the data model before making changes, or ask me first" or "that's what ROLL plans are for" or "I'll show a warning in the UI" or whatever. Fundamentally, you're thinking about a way to do a thing, if you're even thinking about it at all.

So yeah, that's what I've spent the last year or two doing. :-)

thewhitetulip · on March 3, 2017

Even I have been doing this, but not quite at this scale, it is mostly python scripts to automate something, but because of the low scale and that I am the sole owner + user, I am good to go :-D

awj · on March 2, 2017

To be fair, it looks like you agree with AWS on this point.

BurningFrog · on March 2, 2017

Or the root cause is a UI that allows mistakes like these.

chrisseaton · on March 2, 2017

Is it possible to build a UI that will not allow you to make a mistake?

If the computer knows exactly what actions would be a mistake, why can't it just do the correct actions (those that aren't a mistake) automatically?

avar · on March 2, 2017

You don't have to know what would be a mistake. E.g. if the tool is used most of the time to operate on a small set of servers, you have some extra confirmation or command-line option for removing a large set.

That's good UI design in tools with powerful destructive capabilities. You make the UI to do lots of things v.s. the few things you do routinely different enough that there's no mistaking them.

rrmm · on March 2, 2017

You can also have the program tell the user what's going to happen (if it can be computed beforehand), e.g. "This will affect 138 server(s)."

avar · on March 2, 2017

Yes, but be careful. UIs like that tend to accumulate "--yes" options, because you don't feel like being asked every time for 1 server. Then one day you screw up the wildcard and it's 1000 servers, but you used the --yes template.

Which is why I'm pointing out that to design UIs like these you should fall back on slightly different UIs depending on the severity of the operation.

danek · on March 6, 2017

This is a good pattern to use. The more pre-feedback I get, the less likely I am to make a horrible mistake. However one problem I often see with this pattern is the numbers are not formatted for humans to read. Suppose it prompts:

  "1382345166 agents will be affected. Proceed? (y/n)"

Was that ~100k or ~1M agents? I can't tell unless I count the number of digits, which itself is slow and error-prone. It's worse if I'm in the middle of some high-pressure operation, because this verification detour will break my concentration and maybe I'll forget some important detail.

Now if the number is formatted for a human to consume, I don't have to break flow and am much less likely to make an "order-of-magnitude error":

  "1,382,345,166 (1.4M) agents will be affected. Proceed? (y/n)"

I always attempt to build tooling & automation and use it during a project, rather than running lots of one-off commands. I find this usually saves me & my team a lot of time over the course of a project, and helps reduce the number of magical incantations I need to keep stored in my limited mental rolodex. I seem to have better outcomes than when I build automation as an afterthought.

Atheros · on March 3, 2017

This doesn't work. Users learn to ignore the message.

danek · on March 6, 2017

I think it depends on the quality of the feedback. Most tooling sucks, so the messages are very literal trace statements peppered through the code. , vs what the user-facing impact will be. When the thing is just spitting raw information at me, I'm probably going to train myself to ignore it. But if it can tell me what is going to happen, in terms that I care about, then I'll pay attention.

Imagine I just entered a command to remove too many servers that will cause an outage:

  "Finished removing servers" 
  (better than no message, I suppose)

vs

  "Finished removing 8 servers"
  (better, it's still too late to prevent my mistake 
    but at least I can figure out the scale of my mistake)

vs

  "8 servers will be removed. Press `y` to continue"
  (better, no indication of impact but if I'm paying
     attention I might catch the mistake)

vs

  "40% capacity (8 servers) will be removed. 
    Load will increase by 66% on the remaining 12 servers. 
    This is above the safety threshold of a 20% increase. 
    You can override by entering `live dangerously`."
  (preemptive safety check--imagine the text is also red so it stands out)

adrianN · on March 2, 2017

Obviously some UIs make some errors less likely. You don't have the "launch the nukes" button right next to the "make coffee" button, because humans are clumsy and don't pay attention.

Bartweiss · on March 2, 2017

Fat-finger implies you made your mistake once. A UI can't stop you from setting out to do the wrong thing, but it can make it astronomically unlikely to do a different action than the one you intended.

Simple example: I have a git hook which complains at me if I push to master. If I decide "screw you, I want to push to master", it can't assess my decision, but it easily fixes "oops, I thought I was on my branch".

brianpan · on March 2, 2017

A good UI should be able to help, especially in critical situations. I imagine Amazon will consider something like this:

> The dosage you ordered is an order of magnitude greater than the dosage most commonly ordered for this medicine. Continue? y/n

chrisseaton · on March 2, 2017

But that still allows you to make a mistake - by pressing y when that's the wrong thing to do.

TeMPOraL · on March 2, 2017

There's a balance to be struck. I'd say number of hoops you have to jump through to do something should scale with the potential impact of an operation.

That said, the only way to completely prevent mistakes is to make the tool unable to do anything at all.

(Or to encode every possible meaning of the word "mistake" in your software. If you could do that, you would probably get a Nobel prize for it.)

vocatus_gate · on March 2, 2017

In a program I wrote I make the user manually type "I AGREE" (case-sensitive) in a prompt before continuing, just to avoid situations where people just tap "y" a bunch of times.

Piskvorrr · on March 2, 2017

Habituation is a powerful thing: a safety-critical program used in the 90s had a similar, hard-coded safety prompt (<10 uppercase ASCII characters). Within a few weeks, all elevated permission users had the combination committed to muscle memory and would bang it out without hesitation, just by reflex: "Warning: please confirm these potentially unsaf-" "IAGREE!"

TeMPOraL · on March 2, 2017

It's indeed a real problem. Hell, I myself am habituated to logins and passwords for frequently used dialog boxes, and so just two days ago I tried to log in on my work's JIRA account using test credentials for an app we're developing...

For securing very dangerous commands, I'd recommend asking the user to retype a phrase composed of random words, or maybe a random 8-character hexadecimal number - something that's different every time, so can't be memorized.

askmike · on March 3, 2017

I think that even if someone can't memorize the exact characters, they'll memorize the task of having to type over the characters. Better would be to never ask for confirmation except in the worst of worst cases.

TeMPOraL · on March 3, 2017

That's what I meant in my original comment when I wrote that "number of hoops you have to jump through to do something should scale with the potential impact of an operation". Harmless operations - no confirmation. Something that could mess up your work - y-or-n-p confirmation. Something that could fuck up the whole infrastructure - you'd better get ready to retype a mix of "I DO UNDERSTAND WHAT I'M JUST ABOUT TO DO" and some random hashes.

saganus · on March 3, 2017

Not sure if even that would work.

I've almost deleted my heroku production server even though you need to type (or copy paste....ahem...) the full server name (e.g. thawing-temple-23345).

I think the reason was that because in my mind I was 100% sure this was the right server, when the confirmation came up I didn't stopped to look if indeed this was the correct one so I mechanically started to type the name of the server and just a second before I clicked ok, I had this genius idea to double check.... Oh boy... My heart dropped to the floor when I realized what was I about to do.

You could say that indeed Heroku's system of avoiding errors worked correctly....

However the confirmation dialog wasn't what made me stop... Instead it was my past-self's experience screaming at me and remembering me that ONE time where I did fucked up a production server years ago (it cost the company a full day of customers' bids... Imagine the shame of calling all the winning bidders and asking them what price did they end up bidding to win....)

My point is, maybe no number of confirmation dialogs however complex they are, will stop mistakes if the operator is fixed on doing X. If you are working in a semi-autopilot mode because you obviously are very smart and careful (ahem..) you will just do whatever the dialog asks you to do without actually thinking what you are doing.

What, then, will make you stop and verify? My only guess is that experience is the only way. I.e. only when you seriously fuck up you learn that no matter how many safety systems or complex confirmation dialogs there are you still need to double and triple check each character you typed, lest you want to go through that bad experience again....

wumpus · on March 3, 2017

A well-designed confirmation doesn't give you the same prompt for deleting some random test server as it does for deleting a production server. That helps with the "autopilot mode" issue.

saganus · on March 3, 2017

I agree that it should help reduce the amount of mistakes.

But I still believe auto-pilot mode is a real thing (and a danger!) .

My point is that I'm not sure if it's even possible to design one that actually cuts errors to 0.

And if that's indeed the case, even if it's close to 0, it's still non-zero, thus at the scale Amazon operates at, it's very probable that it will happen at least one time.

Maybe sometime in the future AI systems will help here?

wumpus · on March 3, 2017

I totally agree that it's a real issue, a danger, and that it's impossible to cut errors to zero.

I've also built complex systems that have been run in production for years with relatively few typo-related problems. The way I do it is with the design patterns like the one I just mentioned, which is also what TeMPOraL was talking about (and I guess you missed it.)

If you have the same kind of confirmation whenever you delete a thing, whether it's an important thing or not, you're designing a system which encourages bad auto-pilot habits.

You'll also note that Amazon's description of the way that they plan on changing their system is intended to fire extra confirmation only when it looks like the operator is about to make a massive mistake. That follows the design pattern I'm suggesting.

TeMPOraL · on March 3, 2017

> My point is that I'm not sure if it's even possible to design one that actually cuts errors to 0.

Personally, I don't believe it is without making the tool impotent. But you can try and push down the error probability down to arbitrarily low value.

Piskvorrr · on March 3, 2017

Still no help against "whoops, took down a different production instance than intended."

Atheros · on March 3, 2017

We're assuming that the software in question is even aware of the potential impact. It might not have that information.

brianpan · on March 3, 2017

This prevents fat-finger mistakes.

You could go further and try to prevent cat-on-the-keyboard mistakes, which is maybe what you're describing (solve this math equation to prove you are a human who is sufficiently not inebriated). Or even further and prevent malicious, trench-coat wearing, pointy-nosed trouble-makers.

The point is, yes, it is possible. That's what good design does.

jonknee · on March 2, 2017

It's not possible to be perfect, but you can certainly do better than taking down S3 because of a single command gone wrong.

One thing I have been doing for my own command line tools is making a preview feature for what a command will do and make the preview state be default. It's simple, but if the S3 engineer first saw a readout of the huge list of servers that were going to be taken offline instead of the small expected list we probably would not be talking about this. There's obviously a ton more you can do here (have the tool throw up "are you sure" messages for unusual inputs, etc).

Piskvorrr · on March 2, 2017

If the computer knows exactly what actions would be a mistake - how? The difference between correct and incorrect (not to mention legal and illegal) is usually inferred from a much wider context than what is accessible to a script. Mind you, in this specific case, Amazon even implies that such a command could have been correct under other circumstances.

So, this means a) strong superhuman AI (good luck), b) deciding from an ambiguous input to one of possibly mistaken actions (good luck mapping all possible correct states), or c) drool-proof interface ("It looks you're trying to shut down S3, would you like some help with that?").

TL;DR: yes, but it's a cure worse than the disease.

chrisseaton · on March 2, 2017

> If the computer knows exactly what actions would be a mistake - how?

I don't know. I was suggesting it wasn't realistic to do that, and therefore it wasn't realistic to implement a UI that prevents you making mistakes.

obstinate · on March 2, 2017

That's what they claim they will do to ameliorate this. They will build limits into their tools.

Mistletoe · on March 2, 2017

Why weren't they there already?

BurningFrog · on March 2, 2017

Maybe they were, but they missed this one thing?

neurostimulant · on March 2, 2017

Hindsight is 20/20.

Someone1234 · on March 2, 2017

Because human error isn't foreseeable? Or a disgruntled employee?

obstinate · on March 2, 2017

Because there are limits to engineering resources even at Amazon.

netcraft · on March 2, 2017

It can validate inputs and not let you enter out of range data. You can know that an answer is wrong without knowing what the right answer is.

GalacticCmdr · on March 2, 2017

Possibly the values are all with range. It was just that this operation only worked on elements that were a subset. No amount of validation will catch that error.

You could feedback a clarification, but if that happens too often nobody will double check it after they have seen it over and over.

fixermark · on March 2, 2017

While you can't prevent user error without preventing user capability, you can (as others have observed) follow some common heuristics to avoid common failure modes.

A confirm step in something as sensitive as this operation is important. It won't stop all user error, but it gives a user about to accidentally turn off the lights on US-EAST-1 an opportunity to realize that's what their command will do.

nkrisc · on March 2, 2017

No. But a good UI can help you see the mistake you're about to make.

flat6turbo · on March 2, 2017

> Is it possible to build a UI that will not allow you to make a mistake?

no, because developers still produce code with bugs.

kevan · on March 2, 2017

Good UX is important, even for things like scripts. Unfortunately a lot of tech people take pride in working with hard-to-use and error-prone tools.

smsm42 · on March 2, 2017

If you have UI that allows to undeploy 10 servers, it will also allow to undeploy 100 servers. Unless you specifically thought about possibility that there might be lower bound of number of servers, which they obviously didn't before that. It's easy to talk about it after the fact, but nobody is able to predict all such scenarios in advance - there are just too many ways to mess up to have special code for all of them in advance.

colanderman · on March 2, 2017

It's not really a UI issue.

The tool as a whole should incorporate a model of S3. Any action you take through the UI should first be applied to this model, and then the resulting impact analyzed. If the impact is "service goes down", then don't apply the action without raising red flags.

Where I work we use PCS for high availability, and it bugs the heck out of me that a fat-fingered command can bring down a service. PCS knows what the effect of any given command will be, but there's no way (that I know of) to do a "dry run" to see whether your services would remain up afterward.

TeMPOraL · on March 2, 2017

Interesting.

In practice, it would likely be very hard to make a model of your infrastructure to test against, but I can imagine a tool that would run each query against a set of heuristics, and if any flags pop up, it would make you jump through some hoops to confirm. Such a tool should NEVER have an option to silently confirm, and the only way to adjust a heuristic if it becomes invalid should be formally getting someone from an appropriate department to change it and sign off on it.

By the way, this is how companies acquire red tape. It's like scar tissue.

colanderman · on March 3, 2017

For many systems, the rule is simply "X of Y servers must be up". Something like that isn't too hard to enforce.

smsm42 · on March 2, 2017

They probably didn't know the service would go down. For that, you need to identify the minimal requirements for the service to be up upfront, and code that requirements into the UI upfront. Most tools don't do that. File managers don't check the file you delete is not necessary for any of the installed software packages to run. Shells don't check the file you overwriting doesn't contain vital config file. Firewall UIs don't check that this port you're closing isn't vital for some infrastructural service. It would be nice to have a benevolent omniscient God-like UI that would have foresight to check such things - but usually the way it works is that you know about these things after the first (if you're lucky) time it breaks.

kermatt · on March 2, 2017

Or it makes you re-enter the quantity of affected targets as a confirmation, similar to the way GitHub requires a second entry of a repo name for deletion.

SilasX · on March 2, 2017

Agreed -- a postmortem should cite that deployment goof as the immediate cause, with a contributory cause of "you can goof like this without getting a warning etc".

EvanAnderson · on March 2, 2017

Computers are devices built to amplify human error.

guelo · on March 2, 2017

Bicycles for the bumbling mind.

keldaris · on March 3, 2017

I don't understand how this is even possible in a company operating on that scale. Granted, I'm a lowly scientific programmer with no clue about running a cloud infrastructure, but I would have imagined that there would be at least a pretense of oversight for destructive commands run in such an environment. A scheme as simple as "any destructive command run on S3 subsystems is automatically run in a dry run form, and requires independent confirmation by 2-3 other engineers to actually come into effect" would have prevented this altogether. Given the overall prominence of S3, this incident seems to demonstrate a rather callous attitude on the part of the organization.