> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.
Look at the language used though. This is saying very loudly "Look, this isn't the engineer's fault here". It's one thing I miss about Amazon's culture- not blaming people when system's fail.
The follow-up doesn't bullshit with "extra training to make sure no one does this again", it says (effectively) "we're going to make this impossible to happen again, even if someone makes a mistake".
Any time I see "we're going to train everyone better" or "we're going to fire the guy who did it", all I can read is "this will happen again". You can't actually solve the problem of user error with training, and it's good to see Amazon not playing that game.
What bothered me about running TrueTech is that customers would sometimes demand repercussions against employees for making mistakes.
Enter Frans Plugge. Whenever a customer would get into that mode we'd fire Frans. This was easy, simply because he didn't exist in the first place (his name was pulled from a skit by two Dutch comedians, bonus points if you know who and which skit).
This usually then caused the customer to recant on how he/she never meant for anybody to get fired...
It was a funny solution and we got away with it for years, for one because it was pretty rare to get customers that mad to begin with and for another because Frans never wrote any blog posts about it ;)
But I was always waiting for that call from the labor board
to ask why we fired someone for who there was no record of employment.
It is unreasonable for me to think that company owners should have the spine to say, "We take the decision to fire someone very seriously. We'll take your comments under consideration, but we retain sole discretion over such decisions."
It irks me that businesses fire people because of pressure from clients or social media. But having never been the boss, I may be missing something.
One reason to like a facet of Japanese management culture: if a customer wants someone to rake over the coals you offer management, not employees.
Internal repercussions notwithstanding, externally the company is a united front. It cannot cause mistakes by luck, accident, or happenstance, because the world includes luck, accidents, and happenstance, so any user-visible error is ipso facto a failure of management.
apparently this is (or was) a job in japan. companies would hire what amounts to an actor to get screamed at by the angry customer, and pretend to get fired on the spot. rinse, repeat whenever such appeasement is required.
I know one person who does this for real estate developers. He gets involved in contentious projects early on, goes to community meetings, offers testimony before the city council, etc. When construction gets going and people inevitably get pissed about some aspect of the project, he gets publicly fired to deflect the blame while the project moves on. Have seen it happen on three different projects in two cities now and, somehow, nobody catches on.
I don't know how to describe this in a single word or phrase appropriately, but I think it is a "genius problem" to exist. Not a genius solution. I feel that the problem itself is impressive and rich in layers of human nature, local culture etc - but once you have such a problem any average person could come up with a similar solution, because it is obvious.
It's still mind blowing and very amusing that this is a thing in our world!
Do you have a citation for that? I am curious; it's something I've never heard of and goes against my intuitions/experience regarding what traditionally managed Japanese companies would do. (Entirely possible it has happened! Hence the cite request.)
Imagine if the customer saw the same actor getting fired in different companies! Is the customer going to catch on? More likely, they will think "Yeah, no wonder there was a problem. This same incompetent dude wormed his way into this company too" :-)
There's a movie sketch in here somewhere. A guy has the worst day of his life, every single thing goes wrong, and at every single company the same person is "responsible" for the issue.
Historically some cultures practiced mock firing as a way to appease an angry customer. This was back in the day when most business transactions occured face to face so the owner should demand the employee to pack their belonging and leave the premises in full view of the customer. Of course this is all for show but this kind of public humiliation seems to satisfy even the most difficult customers.
Even when the customer knew it was show, I see it as a way of saying, "yes, we acknowledge that we screwed up and make a public, highly visible note of it that will be recorded in the annals of peoples' gossip in this area".
To what lengths did you keep that up? Did you just tell the client that informally, formally in a reunion or so, or did you actually put it in writing or even fake some "firing" paperwork?
This is true in some cases, but not when mitigations aren't practiced properly - it's not the fat fingered user who should be fired or retrained, but the designer or maintainer of the system that allowed it to become a serious issue.
Look at the recent GitLab incident - one guy messed up and nuked a server. Okay, that happens sometimes, go to backups. Uh oh, all the backups are broken. Minor momentary problem just turned into a major multi-day one.
That's a problem and one which could be preventable with training (or, arguably, firing and hiring). Maintaining your backups properly should be someone's duty, designing and testing systems to minimize impact of user error should be too.
Fair enough. I guess what I meant was specifically using training or punishment to combat "momentary lapse" issues.
If someone doesn't test their backups, you train them to test backups. If someone lies about testing the backups, maybe you fire them. But if someone trips and shatters the only backup disk, you don't yell at them - you create backups that an instant of clumsiness can't ruin.
I did overstate, training is perfectly reasonable, but I often see it cited exactly when it shouldn't be, as a solution to errors like typos or forgetfulness.
Training about testing backups is still a bad idea: Why make someone do a job that purely verification? Those jobs eventually stop getting done, and it's hard to keep people doing them.
Instead, you make a machine verify the backups simply by using the backups all the time. For example, at work I feed part of our data pipeline with backups: Those processes have no access to the live data. If the backups break, those processes would provide bad information to the users, and people would come complaining in a matter of minutes.
Just like when you have a set of backup servers, you don't leave them collecting dust, or tell someone to go look at them every once in a while: you just route 1% of the traffic through them. They are still extra capacity, you can still do all kinds of things to them without too much trouble, but you know they are always working.
Never, ever, force people to do things they don't gain anything from. Their discipline would fade, just like it fades when you force them to a project management tool they get no value from.
No. You don't make a daily task of testing backups. That would be wrong for precisely the reasons you cite. It's a waste of effort and time, and ignores what the point of testing them is for: ensuring that the procedure still works.
One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data. The rest of the year it's only worth keeping an automated process reporting whether or not the things are being made, and people keeping an eye on change management to be sure no changes are made to the known-to-be working process that can break it without the new process incurring an explicit vetting cycle. Gitlab wasn't apparently testing or engaging in monitoring what was supposed to be an automated process. That's where they got burned.
Process monitoring may be boring as hell, but it's seldom wasted effort, and will prevent massive, compounded headaches from bringing operations to a chaotic halt.
I agree. I have the same policy when setting up servers: don't have a "primary" and a "backup" server, make both servers production servers and have the code that uses them alternate between them, pick a random one, whatever. (I don't always get to implement this policy, of course.)
This makes some sense, but I don't think it negates testing backups? For duplicate live data, yeah, you can't just use both. But most businesses have at least some things backed up to cold storage, and that still needs to be popped in a tape deck (or whatever's relevant) and verified.
I don't buy the "If we just plan ENOUGH, disasters will never occur" argument. It's the universe is just too darn interesting for us to be able to plan enough to prevent it from being interesting.
It is all about hazard/risk modeling and mitigation. E.g.
Someone rm -rf / ing the server will happen eventually with near 100% certainty in any company and can be mitigated by tested, regular, multiply redundant backups.
Cosmic rays flipping bits will happen with near 100% probability at the scale someone like Amazon works at can by mitigated by redundant copies and filesystems with checksum style checks. Similar with hard drive failure.
Earthquakes will happen in some areas with near certainty over the time periods companies like Amazon presumably hope to be in business and could be mitigated by having multiple datacenters and well constructed buildings. Similar for 'normal' scale volcanoes.
Fires will happen but they can be mitigated (with appropriate buildings and redundancy).
Small meteorite stikes are unlikely but can be mitigated by redundancy.
Solar activity causing an electomagnetic storm - yeah one can shield one's datacenter in a Faraday cage but in this situation the whole world is probably in chaos and one's datacenter will be the least of one's concerns (unless shielding become standard in which case you'd better be doing it). Similar applies for nuclear war, super volcanoes, massive meteorite strikes or other global events at the interesting end of the scale.
But yeah there are going to be things that get missed. They key is having an organization that (1) learns from its mistakes and (2) learns from others' mistakes and continually keeps their risk modeling and mitigation measures up to date. And note that many of the hazards that are worth mitigating have the same mitigation i.e. redundancy (at different scales).
Giving people training in response to things like this always seemed a little strange to me - that particular person just got the most effective training the world has ever seen. If you look at it that way, you could say that everything Amazon spent responding to this was actually a training expense for this particular person and team. After you've already done that, it seems silly to make them sit through some online quiz or PowerPoint by a supposed guru and think you're accomplishing anything.
Yeah indeed. You know who the one person at Amazon is that I'd expect to never fat finger a sensitive command ever ever again? The guy who managed to fat finger S3 on Tuesday. Firing him over this mistake is worse than pointless, it offers absolution to every other developer and system that helped cause this event.
I wouldn't merely call this a fat finger or typo - it's quite possible that the usage of the tool itself was so nefarious that mistakes would be impossible to avoid, given the complexity of its inputs.
Based on Amazon's decision to improve the tooling such that this category of error would be (hopefully) impossible to reproduce, I would lean more towards that being the case.
I think the value of making these mistakes is in learning from them and then making sure they can't happen again. Leaving this process in place and just making this guy run the command forever because he screwed it up once would be a much less effective solution than fixing the tooling so it's impossible to do this in the first place. Saying "this guy don't do it again" also offers absolution to everyone else on the team. In a healthy culture only we can fail.
Is it not true for you? I know that I'm personally good at avoiding the same mistake. I'm also extraordinarily good at avoiding repeating catastrophic mistakes. I generally change my processes in the same way that Amazon is changing their processes to avoid this mistake.
I am not talking about what Amazon are doing, but the concept that the individual wont make the same mistake again, which is what the grandparent is getting that.
He won't make the same mistake because no one makes the same big mistake twice? I wouldn't bank on that alone.
Years ago I read a story about a fat-fingered ops person getting called into the CEO's office after an outage. "I thought you were calling me in to fire me." "I can't afford to fire you, today I spent a million dollars training you."
Which is really the point of automation and configuration management. When a manager asks you, "How are you going to prevent this in the future?" You can say, "We added a check so n must be less than x% of the total number of cluster members," or "We added additional unit tests for the missing area of coverage" or "We added new integration tests that will pick up on this."
Tests and configuration scripts don't prevent all breakage. But when you have them, you can say, "We missed that, let's add it," or "That failed, but it's a false positive. Let's add this edge case to this test."
If you have no automation, tests or auditing systems around running deployments, you can't do any of this.
I agree testing and automation are good. I think they need to go beyond this to formal verification, for something on this scale and reliability. NASA doesn't make these sorts of mistakes.
By the way - this is not just Amazon's problem now. We know the internet has a single point of failure. So does a lot of IoT.
(Specifically https://www.youtube.com/watch?v=6OalIW1yL-k#t=3m but it's worth watching the whole clip (or even the whole movie) if you haven't seen it before. It's from Terry Gilliam's "Brazil".)
>We know the internet has a single point of failure.
It has? I have yet to see the day where I can neither reach my email provider nor Google nor Hackernews. My local provider might screw up occasionally, or some number of of websites go unreachable for whatever reason. But I fail to come up with anything short of cutting multiple see cables that causes more than 50% of servers to be unreachable to more than 50% of users.
Amazon do formally verify AWS (they use TLA+), which is probably why this failure is a human error. Of course, you could expand the formal analysis of the system to include all possible operator interactions, but you'll need to draw the line at some point. NASA certainly makes human errors that result in catastrophic failures. The Challenger disaster was also a result of human error to a large degree[1]; to quote Wikipedia: "The Rogers Commission found NASA's organizational culture and decision-making processes had been key contributing factors to the accident, with the agency violating its own safety rules."
This is the major basis of the CMM Levels [1]. At higher levels of maturity and necessity, systems and processes are designed to increasingly prevent errors from reaching a production environment.
Amazon is taking the right approach here. The fact that a system as complex and important as S3 can be taken down is a failure of the system, not the person who took it down accidentally.
A lot of IT vendors I have worked with, they all were CMM/CMMi level 5. But the crappiness in their work development/process/deployment etc make me wonder if all their efforts go in attaining those certifications as oppose to doing something better.
As someone who worked for an IT vendor with certification and as someone who was part of the certification team at another place, I can assure you that you're right.
The certification is more for the organization/unit and the people working do not realize what they are for. Another thing that usually becomes a problem is the rigidity of the certification. Saying you need X, Y and Z documented is easy, but it doesn't work for projects that maybe don't have Y. So people make up documentation and process just to be compliant, this soon becomes a hinderance to the work.
At this point people either abandon the process or follow it and the work suffers.
Thank you for adding this comment. I am glad there are more people out there that aren't afraid to be honest about some of the nonsense 'follow the process no matter what' stuff that I have experienced over the years.
CMM level 5 ==> You have a well-documented, repeatable, and still horrible process that declares all errors statistically uncommon by "augmenting" the root cause with random factors. Insta-certification.
I've had the privilege of either working for myself, the company that acquired mine and let me run the dev, or at Google. From that perspective, and what I understand about ops, the rarity is not having the attitude mentioned in the parent.
This is good. And for the software engineers, great.
I've heard from people doing the grunt work at amazon -- warehouse staff -- that Amazon incentivises employees to rat out each other for mishandling, late time etc, fostering intense competition
I spent time in the fulfillment centers, writing software for them. I definitely didn't see that sort of thing. There's no need- the software tracked everything they did. Low performer would be found and retained or 'promoted to customer' without the need for anyone to 'rat out'.
Plus, managing humans in a 'rat out' system would be incredibly inefficient. Now you need lots of employees just to listen to the ratting!
Agreed, especially regarding the culture but isn't this pretty much the same explanation they gave a few years ago when something similar happened?
I seem to recall an EC2 or S3 outage a few years ago that boiled down to an engineer pushing out a patch that broke an entire region when it was supposed to be a phased deployment.
I could be mis-remembering that but it's important that these lessons be applied across the whole company (at least AWS) so it would be a bigger mark against AWS if this is a result of similar tooling to what caused a previous outage.
I believe Jeff once said something along the lines of "why would I fire an employee that made an honest mistake? I just spent a bunch of money teaching him a lesson"
The linked article also says the tools they use were changed to limit the amount of resources that could be taken down at a single time, the speed they could be taken down at, and a hard floor was put on the number of instances that could be stopped.
That's a lot more than just extra training, and a lot better than a two-key system.
> Those in the U.S. that had been fitted with the devices, such as ones in the Minuteman Silos, were installed under the close scrutiny of Robert McNamara, JFK's Secretary of Defence. However, The Strategic Air Command greatly resented McNamara's presence and almost as soon as he left, the code to launch the missile's, all 50 of them, was set to 00000000.
> Oh, and in case you actually did forget the code, it was handily written down on a checklist handed out to the soldiers.
I think you have it backwards. The post does not say they will simply be training the problem away. They are putting safeguards into their tooling to prevent the case of a fat finger.
The article leaves little doubt that they didn't know such an event would be so hard to recover from. They knew it wouldn't be easy, but they were surprised by how bad it was.
I've long said something like "To err is human. To fuck up a million times in a second you need a computer."
I may have to upgrade that to take the mighty power of Cloud (TM) into account, though. Billions and trillions of fuck ups per second are now well within reach!
Yes, it is. I believe I added the concept of "fuckups per second", but my memory being what it is and the general creativity of the internet being what it is, I would not be surprised that it either wasn't original or I wasn't the first.
We may think that an automated system requires less understanding of it in order to operate it. But from the other point of view, you have to know what you are doing, consequences of even an small change are big.
This is one of the things that happens with windows, getting up a server is so easy, that people believe that they don't have to understand what's under the hood, and then, we get a lot of miss-configuration and operational issues.
It's one of the reasons that silly guarantees like "twelve 9s of reliability" are meaningless. There are humans here. "Accidental human mishap" is gonna happen sometimes, and when it happens it's probably gonna affect a lot of data. Heck, at around 7 or 8 nines you have to account for the possibility that your operations team will decide that all your data is a vicious pack of timberwolves and needs to be defeated.
Note that's durability not reliability. You might not be able to get at it with every request (I think 99.99% is the target) but it'll still be there if you try again later.
but amazon doesn't offer eleven 9s of availability. I don't think anybody serious does, so arguing how silly eleven 9s of availability is is kind of pointless. The SLA is only four 9s of availability.
Note: they say "S3 is DESIGNED for 11 9s of durability". It's PR-speak to say that they don't give you any guarantee, but in theory the system is designed in a magnificent way.
11 9s of durability is about the likelyhood of AWS loosing your data. It doesn't cover the likelyhood of you being able to access your data that's called availability.
For example on GCS (Google's S3)...A storage class specifies how many locations the data is made available. All storage classes share the same durability (chance of google loosing your data) of 99.999999999%, but have different availability (chance of being able to retrieve data).
It says that their ideal-case failure rate is 11 nines; that's how much you should lose to known, lasting issues like machines failing and cutting over.
Amazon's actual SLA offers 2 nines and 3 nines as the credit thresholds. So they're stating the reliability of their known system, and the rest is for events like this.
Durability and uptime are not the same thing. Durability is about the chance of losing your data and has nothing to do with service disruptions. Their uptime SLA is much lower. Looking at [1], it looks like the SLA says 3 9s (discounts given for anything lower) of uptime.
As I understand it, those guarantees don't mean that the service will actually stay up for the given number of 9s; it's that you'll be reimbursed monetarily if and when they go down.
Kinda the same thing, though. I mean, from my perspective there's no substantive difference between me saying "this service will stay up 99.xx% of the time and me buying insurance to pay you for the 0.xx% of the time I might fail.
The alternative is that I use the insurance to pay my legal fees when you sue me for not meeting my uptime guarantees.
It's not the same thing. The Amazon service might only be costing you $100/mo, but if it goes down the cost to your business might be millions. They'll reimburse you the $100, not the millions.
Yeah as soon as I read this I felt bad for the employee. I remember writing an update statement without a where clause and having to restore the table from backup. But that was at a company not as advanced as Amazon. Fat fingering a key like that is just crazy (but comforting that even at Amazon it happens) and I'm sure they've fixed that from happening again.
FWIW: Setting "safe-updates=1" in ~/.my.cnf will require UPDATE and DELETE statements in the client to have a WHERE clause which references a key. It's not perfect protection, but it will save you from a lot of mistakes.
DELETE * FROM table WHERE [long condition that resolves to true for all records]
Now i write
SELECT or SELECT COUNT(*) over and over again until i see the data i expect and then change it to a DELETE/UPDATE.
It's not my personal habit but some folks I know turn off auto commit and BEGIN a transaction every time they enter an interactive SQL sessions. They then default to ROLLBACK at least once before COMMITing them.
That and having a user with read-only permissions or a read replica
hmm, that's kinda cool. I'm in a MS shop and I don't know if SMSS has the same feature. My manager just looked at me and said "welp, go restore the table and be more careful next time." I was a new DBA at the time, still, kinda new.
I once brought down our entire production XenServer cluster group by issuing a "shutdown now" in the wrong SSH window. Needless to say it was a bad feeling watching Nagios go crazy and realizing what had just happened.
root@baz # shutdown now
W: molly-guard: SSH session detected!
Please type in hostname of the machine to shutdown: foo
Good thing I asked; I won't shutdown baz ...
Surprising to see such a simple protection neglected.
Oh crap! Yeah I bet you were pretty panicked. My update statement destroyed data that my team used all the time so I was worried I'd get fired. Luckily that wasn't the case.
When I first started using Linux and wanted to do some housecleaning, I did "rm -r (asterix) in a folder. Cleaned up everything, no prob. Then went to some more folders, hit the up arrow on my keyboard fast to get to a command I had used before. Hit 'enter' before my brain realized I had landed on rm -r (asterix) and not the right command. Never used that command again.
Automation tends to make those kinds of errors worse rather than better. Perhaps more infrequent and of a different nature than before, but screwing up an automated action cascades much, much faster than a human initiated one. As a result, you have to watch things a good deal closer and build in more and tighter safe guards.
I've heard (and sometimes pushed) this rhetoric before, but something should be well understood before it's automated. Things that happen very rarely should be backed with a playbook + well exercised general monitoring and tools. This puts human discretion in front of the tools' use and makes sure ops is watching for any secondary effects. Ops grimorae can gather disparate one offs into common and tested tools but they don't do anything to consolidate the reason the tools might be needed.
To me that sounds like development and testing (i.e. figuring out what the steps are). Once you have that it should be automated fully.
Too often people will put up with the, "well, we only do this once a month so it's not worth automating". Literally, I script everything now, just in simple bash... if I type a command, I stick it into a script, and then run the script. Over time you go back and modify said script to be better, eventually this turns into more substantive application. At a certain point, around the time that you have more than one loop, are trying to do things based on different error scenarios, it's probably time to turn to rewriting it in another language.
The simplest thing this does for me, is guarantee that all the parameters needed are valid and present before continuing.
I've been doing it this way for years and it really, really works. Some places have reservations with it since its lack of formality is considered "risky" by some.
Though, an alternative to switching to another language is using xargs well. Writing bash with some immutably has been pretty invaluable for my workflows lately. For example
Its probably their name for an automated admin task. The post does bot imply that this was merely a checklist of things to do. Ansible calls their automatiin receipts playbook as well.
It's probably a page on the internal Wiki that the S3 team follows for that particular task. Most the actual steps are probably automated, but it sounds more like a checklist.
I used to follow runbooks/playbooks written on the internal wiki when I worked at Amazon.
I don't think it means "playbook" in the Ansible sense. The dictionary (i.e. Wikipedia) definition of "playbook" is "a document defining one or more business process workflows aimed at ensuring a consistent response to situations commonly encountered during the operation of the business", and that's how I know it.
At $work, certain types of frequently-occurring alerts have playbooks that document how the alert in question can be diagnosed and how known causes can be remedied. Something like "Look at Grafana dashboard X. If metric Y is doing this and that thing, the cause is Z. Log on to box 16 and systemctl restart the foo.service."
To be fair, the real problem isn't that someone screwed up a playbook or command. The real problem is that a tiny mistake in a command can cause an entire service to be disrupted for hours. That's the problem that needs to be fixed.
"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future."
It seems like sometimes this is just how iteratively automating things works, especially on an internal-facing tool.
You have some process that starts out being "deploy this app with this java code". You deploy once and while, so it's not a big deal. But then those changes get a bit more frequent and so you pull out the common bits and the process becomes "make this YAML change in git and redeploy the app".
That works until you find yourself deploying 5 times a day, so you turn it into a MySQL table, and the process becomes "write a ROLL plan that executes this UPDATE x=y WHERE t=u; command"
After a while you get super annoyed at some quirk of the commands and figure, "Ok, fine, I'll just add an endpoint and some logic that just does this for the command case."
Then you wanna go on vacation and the new guy messed up the API request last week, so you figure, "I'll just add a little JS interface with a little red warning if the request is messed up in this way or that before I go".
You get back from vacation and some original interested party (whoever has wanted all these changes deployed) watched the intern make the change and thinks they could just do it themselves if they had access to the interface. You're wary, but you make the changes together a few times and maybe even add a little "wait-for-approval" node in the state machine.
Life is good. You've basically de-looped yourself, aside from a quick sanity check and button press, instead of what was a ~2 hour code + build + PR + PR approved + deploy process.
Then that interested party goes to work for Uber and the rest of your team adds a few functionalities on top of the interface you built and it all goes pretty well, until you realize that now that this thing that used to be 20 YAML objects is now 50k database records, and a bunch of them don't even apply anymore. So you build a button to disable some group of them, but after getting it deployed you realize it's actually possible to issue a "disable all" request accidentally if you click a button in your janky JS front-end before the 50k records download and get parsed and displayed. Oops! This mistake that you and the original interested party would have never made (because you spent the last 2 years thinking about all this crap) is probably a single impatient anxious mouse-click away from happening. So you make a patch and deploy that.
Congrats! You found that particular failure mode and added some protections for it, and maybe added some other protections like rate-limiting the deletions or updates or whatever. That's cool, but is that every failure mode? I bet it isn't. What happens when someone else thinks you have too many endpoints and just drops to SQL for the update?
Basically, yeah, of course you think of this stuff while iterating on it. But you figure "only power users are on the ACL" or "my teammates will understand the data model before making changes, or ask me first" or "that's what ROLL plans are for" or "I'll show a warning in the UI" or whatever. Fundamentally, you're thinking about a way to do a thing, if you're even thinking about it at all.
So yeah, that's what I've spent the last year or two doing. :-)
Even I have been doing this, but not quite at this scale, it is mostly python scripts to automate something, but because of the low scale and that I am the sole owner + user, I am good to go :-D
You don't have to know what would be a mistake. E.g. if the tool is used most of the time to operate on a small set of servers, you have some extra confirmation or command-line option for removing a large set.
That's good UI design in tools with powerful destructive capabilities. You make the UI to do lots of things v.s. the few things you do routinely different enough that there's no mistaking them.
Yes, but be careful. UIs like that tend to accumulate "--yes" options, because you don't feel like being asked every time for 1 server. Then one day you screw up the wildcard and it's 1000 servers, but you used the --yes template.
Which is why I'm pointing out that to design UIs like these you should fall back on slightly different UIs depending on the severity of the operation.
This is a good pattern to use. The more pre-feedback I get, the less likely I am to make a horrible mistake.
However one problem I often see with this pattern is the numbers are not formatted for humans to read. Suppose it prompts:
"1382345166 agents will be affected. Proceed? (y/n)"
Was that ~100k or ~1M agents? I can't tell unless I count the number of digits, which itself is slow and error-prone. It's worse if I'm in the middle of some high-pressure operation, because this verification detour will break my concentration and maybe I'll forget some important detail.
Now if the number is formatted for a human to consume, I don't have to break flow and am much less likely to make an "order-of-magnitude error":
"1,382,345,166 (1.4M) agents will be affected. Proceed? (y/n)"
I always attempt to build tooling & automation and use it during a project, rather than running lots of one-off commands. I find this usually saves me & my team a lot of time over the course of a project, and helps reduce the number of magical incantations I need to keep stored in my limited mental rolodex. I seem to have better outcomes than when I build automation as an afterthought.
I think it depends on the quality of the feedback. Most tooling sucks, so the messages are very literal trace statements peppered through the code. , vs what the user-facing impact will be. When the thing is just spitting raw information at me, I'm probably going to train myself to ignore it. But if it can tell me what is going to happen, in terms that I care about, then I'll pay attention.
Imagine I just entered a command to remove too many servers that will cause an outage:
"Finished removing servers"
(better than no message, I suppose)
vs
"Finished removing 8 servers"
(better, it's still too late to prevent my mistake
but at least I can figure out the scale of my mistake)
vs
"8 servers will be removed. Press `y` to continue"
(better, no indication of impact but if I'm paying
attention I might catch the mistake)
vs
"40% capacity (8 servers) will be removed.
Load will increase by 66% on the remaining 12 servers.
This is above the safety threshold of a 20% increase.
You can override by entering `live dangerously`."
(preemptive safety check--imagine the text is also red so it stands out)
Obviously some UIs make some errors less likely. You don't have the "launch the nukes" button right next to the "make coffee" button, because humans are clumsy and don't pay attention.
Fat-finger implies you made your mistake once. A UI can't stop you from setting out to do the wrong thing, but it can make it astronomically unlikely to do a different action than the one you intended.
Simple example: I have a git hook which complains at me if I push to master. If I decide "screw you, I want to push to master", it can't assess my decision, but it easily fixes "oops, I thought I was on my branch".
There's a balance to be struck. I'd say number of hoops you have to jump through to do something should scale with the potential impact of an operation.
That said, the only way to completely prevent mistakes is to make the tool unable to do anything at all.
(Or to encode every possible meaning of the word "mistake" in your software. If you could do that, you would probably get a Nobel prize for it.)
In a program I wrote I make the user manually type "I AGREE" (case-sensitive) in a prompt before continuing, just to avoid situations where people just tap "y" a bunch of times.
Habituation is a powerful thing: a safety-critical program used in the 90s had a similar, hard-coded safety prompt (<10 uppercase ASCII characters). Within a few weeks, all elevated permission users had the combination committed to muscle memory and would bang it out without hesitation, just by reflex: "Warning: please confirm these potentially unsaf-" "IAGREE!"
It's indeed a real problem. Hell, I myself am habituated to logins and passwords for frequently used dialog boxes, and so just two days ago I tried to log in on my work's JIRA account using test credentials for an app we're developing...
For securing very dangerous commands, I'd recommend asking the user to retype a phrase composed of random words, or maybe a random 8-character hexadecimal number - something that's different every time, so can't be memorized.
I think that even if someone can't memorize the exact characters, they'll memorize the task of having to type over the characters. Better would be to never ask for confirmation except in the worst of worst cases.
That's what I meant in my original comment when I wrote that "number of hoops you have to jump through to do something should scale with the potential impact of an operation". Harmless operations - no confirmation. Something that could mess up your work - y-or-n-p confirmation. Something that could fuck up the whole infrastructure - you'd better get ready to retype a mix of "I DO UNDERSTAND WHAT I'M JUST ABOUT TO DO" and some random hashes.
I've almost deleted my heroku production server even though you need to type (or copy paste....ahem...) the full server name (e.g. thawing-temple-23345).
I think the reason was that because in my mind I was 100% sure this was the right server, when the confirmation came up I didn't stopped to look if indeed this was the correct one so I mechanically started to type the name of the server and just a second before I clicked ok, I had this genius idea to double check.... Oh boy... My heart dropped to the floor when I realized what was I about to do.
You could say that indeed Heroku's system of avoiding errors worked correctly....
However the confirmation dialog wasn't what made me stop... Instead it was my past-self's experience screaming at me and remembering me that ONE time where I did fucked up a production server years ago (it cost the company a full day of customers' bids... Imagine the shame of calling all the winning bidders and asking them what price did they end up bidding to win....)
My point is, maybe no number of confirmation dialogs however complex they are, will stop mistakes if the operator is fixed on doing X. If you are working in a semi-autopilot mode because you obviously are very smart and careful (ahem..) you will just do whatever the dialog asks you to do without actually thinking what you are doing.
What, then, will make you stop and verify? My only guess is that experience is the only way. I.e. only when you seriously fuck up you learn that no matter how many safety systems or complex confirmation dialogs there are you still need to double and triple check each character you typed, lest you want to go through that bad experience again....
A well-designed confirmation doesn't give you the same prompt for deleting some random test server as it does for deleting a production server. That helps with the "autopilot mode" issue.
I agree that it should help reduce the amount of mistakes.
But I still believe auto-pilot mode is a real thing (and a danger!) .
My point is that I'm not sure if it's even possible to design one that actually cuts errors to 0.
And if that's indeed the case, even if it's close to 0, it's still non-zero, thus at the scale Amazon operates at, it's very probable that it will happen at least one time.
Maybe sometime in the future AI systems will help here?
I totally agree that it's a real issue, a danger, and that it's impossible to cut errors to zero.
I've also built complex systems that have been run in production for years with relatively few typo-related problems. The way I do it is with the design patterns like the one I just mentioned, which is also what TeMPOraL was talking about (and I guess you missed it.)
If you have the same kind of confirmation whenever you delete a thing, whether it's an important thing or not, you're designing a system which encourages bad auto-pilot habits.
You'll also note that Amazon's description of the way that they plan on changing their system is intended to fire extra confirmation only when it looks like the operator is about to make a massive mistake. That follows the design pattern I'm suggesting.
You could go further and try to prevent cat-on-the-keyboard mistakes, which is maybe what you're describing (solve this math equation to prove you are a human who is sufficiently not inebriated). Or even further and prevent malicious, trench-coat wearing, pointy-nosed trouble-makers.
The point is, yes, it is possible. That's what good design does.
It's not possible to be perfect, but you can certainly do better than taking down S3 because of a single command gone wrong.
One thing I have been doing for my own command line tools is making a preview feature for what a command will do and make the preview state be default. It's simple, but if the S3 engineer first saw a readout of the huge list of servers that were going to be taken offline instead of the small expected list we probably would not be talking about this. There's obviously a ton more you can do here (have the tool throw up "are you sure" messages for unusual inputs, etc).
If the computer knows exactly what actions would be a mistake - how? The difference between correct and incorrect (not to mention legal and illegal) is usually inferred from a much wider context than what is accessible to a script. Mind you, in this specific case, Amazon even implies that such a command could have been correct under other circumstances.
So, this means a) strong superhuman AI (good luck), b) deciding from an ambiguous input to one of possibly mistaken actions (good luck mapping all possible correct states), or c) drool-proof interface ("It looks you're trying to shut down S3, would you like some help with that?").
TL;DR: yes, but it's a cure worse than the disease.
Possibly the values are all with range. It was just that this operation only worked on elements that were a subset. No amount of validation will catch that error.
You could feedback a clarification, but if that happens too often nobody will double check it after they have seen it over and over.
While you can't prevent user error without preventing user capability, you can (as others have observed) follow some common heuristics to avoid common failure modes.
A confirm step in something as sensitive as this operation is important. It won't stop all user error, but it gives a user about to accidentally turn off the lights on US-EAST-1 an opportunity to realize that's what their command will do.
If you have UI that allows to undeploy 10 servers, it will also allow to undeploy 100 servers. Unless you specifically thought about possibility that there might be lower bound of number of servers, which they obviously didn't before that. It's easy to talk about it after the fact, but nobody is able to predict all such scenarios in advance - there are just too many ways to mess up to have special code for all of them in advance.
The tool as a whole should incorporate a model of S3. Any action you take through the UI should first be applied to this model, and then the resulting impact analyzed. If the impact is "service goes down", then don't apply the action without raising red flags.
Where I work we use PCS for high availability, and it bugs the heck out of me that a fat-fingered command can bring down a service. PCS knows what the effect of any given command will be, but there's no way (that I know of) to do a "dry run" to see whether your services would remain up afterward.
In practice, it would likely be very hard to make a model of your infrastructure to test against, but I can imagine a tool that would run each query against a set of heuristics, and if any flags pop up, it would make you jump through some hoops to confirm. Such a tool should NEVER have an option to silently confirm, and the only way to adjust a heuristic if it becomes invalid should be formally getting someone from an appropriate department to change it and sign off on it.
By the way, this is how companies acquire red tape. It's like scar tissue.
They probably didn't know the service would go down. For that, you need to identify the minimal requirements for the service to be up upfront, and code that requirements into the UI upfront. Most tools don't do that. File managers don't check the file you delete is not necessary for any of the installed software packages to run. Shells don't check the file you overwriting doesn't contain vital config file. Firewall UIs don't check that this port you're closing isn't vital for some infrastructural service. It would be nice to have a benevolent omniscient God-like UI that would have foresight to check such things - but usually the way it works is that you know about these things after the first (if you're lucky) time it breaks.
Or it makes you re-enter the quantity of affected targets as a confirmation, similar to the way GitHub requires a second entry of a repo name for deletion.
Agreed -- a postmortem should cite that deployment goof as the immediate cause, with a contributory cause of "you can goof like this without getting a warning etc".
I don't understand how this is even possible in a company operating on that scale. Granted, I'm a lowly scientific programmer with no clue about running a cloud infrastructure, but I would have imagined that there would be at least a pretense of oversight for destructive commands run in such an environment. A scheme as simple as "any destructive command run on S3 subsystems is automatically run in a dry run form, and requires independent confirmation by 2-3 other engineers to actually come into effect" would have prevented this altogether. Given the overall prominence of S3, this incident seems to demonstrate a rather callous attitude on the part of the organization.
I thought the same thing before I went into the industry but now that I've been in it for a few years (including two at Amazon), it doesn't surprise me.
I suspect locking everyone down in the way you suggest would cost more in lost productivity (and costs for the infrastructure that would be required for greater auditing, etc.) than is lost in outages like this.
Number of lawyers must have drafted those lines and 5 people including Bezos must have approved those lines.
Those lines are not reflective of what Amazon is but what picture Amazon wants to paint now. They have clarified it is their error and not some hacking attempt. Secondly they have not vilified the engineer in question because already Amazon's culture is a bit of a ??? in public mind.
But they have got it right. Shit happens and this is not the first time it has happened or the last time it will happen. Also it will happen with Microsoft, Google and everyone else.
May be we will build even better technologies that will rely on two different cloud providers instead of 1.
It is always going to be like that. If you write software that has a rule "do not remove more than 5% of capacity at once" it will always work, yet if you tell a systems engineer please do not remove 5% of the capacity at once it fails with 0.0x% chance. The solution is that you move the execution of change into a system that spits out steps that is automatically executed by the system itself, entirely removing the human factor.
Every communication channel has its flaws. CLI is fast and that's why it is a favorite. It is also noisy. If you have to worry about a fat finger, you are using the wrong communication channel or could afford to be a bit more verbose within that channel. That's why rm has safety nets.
GUIs are really great. They're a recent development in the computing industry that help mitigate this sort of problem. You can even put prompts in that get you to confirm Yes/No to continue.
I think Borland do some RAD systems, and Microsoft have an IDE of sorts on the way too.
20 years ago I read a postmortem of Tandem and their Non-Stop Unix. A core take-away for me was: "Computer hardware has gotten way more reliable than it was." combined with "The leading cause of outages has become operators making mistakes."
serious question - why did no one ever accidentally launch and nuke a city, with thousands of nuclear warheads able to do so on short notice? like, AWS presumably puts a lot more redundancy in, and yet with all that effort comes up this far short. Why? It has a huge amount of brainpower all set up so that this never ever happens. Whatever works for the military, can't they adopt those actual best practices?
When I think about questions like these, I recall the Anthropic Principle. Perhaps on lots of planets, intelligent life ceased at the beginning of the Atomic Age. Here we are seven decades (several generations!) in, and we're still alive! The numerator on the odds almost doesn't matter, when you never get to see the denominator. Now that we're finding all these planets, perhaps we ought to start looking for nuclear extinction events? They probably wouldn't leave lasting evidence, but if they're common enough they wouldn't need to...
Actually the accounts I've read seem to indicate that most missile operators simply decided they would never launch no matter what. God bless them, for that.
So they're going to build a complex system to correct possible user command line errors. That new system itself will introduce possible errors. Wouldn't an administrative GUI have been much simpler to implement overall?
One of the positive things about Amazon's culture is that they heavily emphasize blaming broken processes, not blaming people. I doubt the person involved will have any negative consequences beyond embarrassment.
I would be horrified if I learned that Amazon or any other company of such size in any way castigates employees for such very human errs. The guilt (don't beat yourself up) he or she likely feels is bad enough.
Anyway, to me this firstly sounds like a "tool" or command that was too powerful with not enough safeguards. Who knows, the command might even be ambiguous.
" At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."
> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.
These sorts of things make me understand why the Netflix "Chaos Gorilla" style of operating is so important. As they say in this post:
> We build our systems with the assumption that things will occasionally fail
Failure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.
> Failure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.
Exactly. It seems likely that Amazon tests the restart operation, but it would be hard to test it at full us-east-1 scale. Running a full S3 test cluster at that scale would likely be a prohibitive expense. Perhaps the "index subsystem" and "placement subsystem" are small enough for full-scale tests to be tractable, but certainly not cheap, and how often do you run it? Also, hindsight is 20/20, but before this incident it might have been hard to identify "full-scale restart of the index subsystem" as rising to the top of the list of things to test.
One approach is to try to extrapolate from smaller-scale tests. It would be interesting to know what kinds of disaster testing Amazon does do, and at what scale, and whether a careful reading could have predicted this outcome.
> Failure at every level has to be simulated pretty often to understand how to handle it
Keep in mind, S3 "fails" all the time. We regularly make millions of S3 requests at my work. Usually we get 1:240K failure rate (mostly GETs), returning 500 errors. However, if you're really hammering an S3 node in the hash ring (e.g. Spark job), we see failures in the 1/10K range, including SocketExceptions, where the routed IP is dead.
You need to always expect such services to die in your code, setting the proper timeouts, backoffs, retries, queues, and dead letter queues.
> Perhaps the "index subsystem" and "placement subsystem" are small enough for full-scale tests to be tractable, but certainly not cheap, and how often do you run it?
Rough guide:
CT = cost of 1 full scale test with necessary infrastructure and labor costs added up
CF = amount of money paid out in SLA claims + subjective estimate of business lost due to reputation damage etc
PF = estimate of probability of this event happening in a given year
if PF * CF > CT, then you run such a test at least once a year. Think of such an expense as an insurance premium.
What Netflix does with their simian army is amortize the cost of doing the test across millions of tests per year and the extra design complications arising from having to deal with failures that often.
This is precisely why cells (alluded to in the write-up) are beneficial. If the size of a cell is bounded and you scale by adding more cells, testing the breaking point of the largest cell becomes an easier problem. There is still usually a layer that spans across all cell boundaries, which is what then becomes hard to test at prod scale (so you make that as simple as possible)
Testing a full zone test is only possible when they have a new zone available, unused. I bet they do these test, and they now have a new scenario to test.
They also probably have one or more test regions where they could perform a test like this. But it's presumably not at nearly the same scale as us-east-1, the region affected by this incident. And to a considerable extent the problem was one of scale. The writeup makes the recovery sound fairly straightforward; but due to the sheer size of S3 in this region, it took hours for the system to come back up, which was apparently unexpected.
(Nit: this incident affected a region, not a zone. us-east-1 is a region, which is divided into zones us-east-1a, us-east-1b, etc. S3 operates on regions.)
I recently saw a talk where they referred to Chaos Monkey (kills instances), Chaos Gorilla (kills many instances for a single service in a single region) and Chaos Kong (takes an entire region offline)
> Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.
Yep. There's a transition period where you can't rely on redundancy any longer because there are so many components that it's basically inevitable that at any given time somewhere something will be in a degraded state. So you design for that case, the degraded normalcy case. You make something failing somewhere a non-emergency. It takes a lot of work to do but when you have things working in that way then you can guarantee that you're in that state by testing it routinely in production.
Totally agree. Would also point out that if you have systems up for many years, they like haven't been updated in the same... shouldn't people find that alarming?
> From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3.
Ensuring that your status dashboard doesn't depend on the thing it's monitoring is probably the first thing you should think about when designing your status system. This doesn't fill me with confidence about how the rest of the system is designed, frankly...
I agree with that in general but having your monitoring system be dependent on the thing it monitors is a pretty big goof. It possible that the dependency was very non-obvious and many layers deep, which is more understandable, but still...its pretty fundamental.
They have a twitter account for such incidents and used it appropriately. They did not slack in relaying the outage to customers, and between that and the fact that no S3 services were operating I think the message was pretty clear: "We fucked up, give us a couple hours"
And apparently they had never tried rebooting some of the most important parts of that system. Just when you start to think that someone's really gotten it right you come to learn they're just fumbling around in the dark like everyone else.
My interpretation of this is that the indexing system was resilient to lost of a certain amount of capacity (probably around ⅓ + 1 host). As a guess, the indexing system probably used some form of consensus (e.g. paxos) which has had an active leader for years. Deployments stay within that capacity constraint, so while hosts have been restarted and replaced (data center migrations, hardware lease expiration, failures, upgrades, etc.), they may have not recently run into a situation where quorum wasn't available for a partition, especially at the scale of restarting the entire fleet.
Since restarting the entire fleet would incur downtime of all relevant S3 operations, it's unlikely that it was something ever intentionally done in production (and they may or may not have run that scenario in other environments).
Source: I used to run several large scale services at Amazon.
To put what @jonhohle said another way, Amazon had probably never brought up the entirety of S3 from Zero to Production-Ready on in a production environment before. I wouldn't necessarily classify this as "fumbling around in the dark." Perhaps they should have tested this in a simulated environment, but (to be fair) on a distributed fault-tolerant system, it probably wasn't a top-priority situation to test.
I'm curious as to why their fix was to host the Service Health Dashboard on more AWS regions. It seems like the responsible thing to do is to host it entirely on a competitor's service. That way, it's very simple to know that the status page will work no matter what happens to you.
If they did host their status page on a competitor's service, then they'd be reliant on that service, which might backfire if the competitor's service goes down while Amazon's own systems stay up.
What they really need is failover capability, which can fire up the status page on a competitor's service (or maybe on a completely separate disaster recovery site site owned by Amazon) in case Amazon's own services go down.
I'm sure Amazon's architects and engineers are more than capable of designing and implementing such a robust system and recognizing its importance. So it puzzles me as to why it wasn't done.
You can use several hosts, have two subdomains so if one is not red responding, engineers and managers know there are two status pages. Heck, have two different domains for them as well if there are dns issues. Amzstatus1.com and 2. Not dependent on Amazon domain anymore either.
Or to host it on a static .html page that gets rewritten every 60 seconds or so by an external process, running on physical servers. Minimal stack, so minimal attack surface.
Or have it pull from two sources, one local (S3) and one remote (GCE or whatever), and make a hard positive from either source signal "down." Otherwise the page would be down if just the remote source were down.
Considering the size of AWS, the number of services one service relies on would be too many, even if a single service many layers deep depended on s3, and when s3 goes down, the service as a whole will be affected. Honestly S3, it like a blackhole, everybody stores everything in S3 these days. Its a horizontal component but used in a vertical manner. Weird but true.
I'd argue that the message of this post-mortem, that is "mistakes were made, but the fault is with the tools and not any one person" is a much better response than the CEO making a symbolic statement claiming the fault.
both better for morale, and better for preventing another incident.
To me that is just another example of 'caring theater'. Whereby carefully crafted PR responses [1] appear to take responsibility in a 'buck stops here' kind of way. The truth is it is unreasonable in many cases for the top person to be able to prevent any and all errors. If you try and make everything perfect with no mistakes you would never make any money (and of course it's not even possible).
[1] ie 'our customers safety and security is of the utmost importance to us'.
What are the real consequences for a CEO saying that? It's not like he's going to get fired or have his stock options revoked. If anything, people are going to praise him for taking ownership like that, as you did. Virtually no matter what he does, I'd bet someone in his position is going to be very comfortable for the rest of his life.
I'd be far more impressed if a low-level employee who's whole family depended on his job and who stood a good chance of getting fired admitted a serious mistake.
We all watched the news and I recall him saying that. The specific quote I don't remember but it was something like " you can consider that I did." I think he was asked what will happen to the person that caused it and who is that person.
Everyone knew right away this had to be human error. Right away. Switches simply had too much redundancy.
It was big then and not sure if I can locate a video.
Not as interesting an explanation as I was hoping for. Someone accidentally typed "delete 100 nodes" instead of "delete 10 nodes" or something.
It sounds like the weakness in the process is that the tool they were using permitted destructive operations like that. The passage that stuck out to me: "in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."
At the organizational level, I guess it wasn't rated as all that likely that someone would try to remove capacity that would take a subsystem below its minimum. Building in a safeguard now makes sense as this new data point probably indicates that the likelihood of accidental deletion is higher than they had estimated.
I'm drawing conclusions based on my time at AWS, but I believe this is due to the service discovery mechanism most of AWS uses. It's a gossip protocol, with a daemon running on each service host. There are very valid reasons you would be manipulating the set of hosts currently being gossiped- to remove a host for maintenance for example. In this use case I think they wanted to take out an entire service, so ensuring a majority is still alive isn't necessarily a solution.
There are two distinct failures here, the tool being too liberal and then the failure mode not being well understood for this index subsystem.
If I remember correctly, all of our processes were organized through a weekly change management process that went through reviews of the exact commands to be run. Being oncall was a bit more liberal, you would typically execute commands on production as needed based on your experience and with others over your shoulder if you had any doubt. Interacting with the gossip protocol was a pretty common thing to do when you were triaging issues.
Unrelated, I was briefly a SME on the EBS billing system and probably interacted with the poor guy whom executed this command.
I always wonder about unintended consequences of this sort of thing. Like someday there will be a worm about to rampage through their servers and someone says, "take them all offline now!" and the answer is, "we can't because of the throttle safeguard we put in place after incident XYZ, it will be about 17 hours..."
By safeguard I meant (and I think Amazon means too) an extra step that is required by the user before they can do the action so they don't do it by accident. Not something that prevents it entirely. Like how an MMO requires you before you delete a character to type the character's name in a box that pops up before you can delete it. That's far outside the realm of usual user interface, but that's so if you are just trying to edit a character it's impossible to accidentally hit that delete key. An analogous system for Amazon that would have prevented this outage: delete 10 nodes, ok. Delete 100 nodes, box pops up saying 'To delete this many nodes you must type the following in to a message box: "I want to take down a dangerously large amount of nodes."'
I think the biggest problem with flags like --emergency is if they end up in daily use, such as git --force. Then, they are both sudo-level AND used without a lot of though.
I'm remembering tools I've worked with where the cheeky dev required things like type the sentence "I know what I am doing and wish to proceed." in order to perform unsafe operations.
I've always wondered why ops hasn't adopted some of the best practices that have been around for years to avoid fat finger errrors. Like why don't we have systems where to do something dangerous requires two separate people run the command, or there's an approval step, or whatever.
The most common explanation: "cute" interactions make it harder to script the command-line tools because you have to account for the extra layer of indirection or write a bit of screen-scrape logic to get the command prompt input.
I've always found that explanation a little threadbare.
I think the hoops you have to jump through should scale with severity. Then if you find yourself writing scripts that screen-scrape interactive output to get around some safeguards, you're doing something very wrong.
Which is why there needs to be an override switch, but it needs to be very very explicit that you are going past the safeguards. And only a limited number of people who can use that override.
There are people who can still do fleet wide root access commands and I don't think that type of thing will ever be removed (just very restricted) for this exact type of situation.
Instead of rate limiting you'd be better off making it a two-key operation once you hit X threshold. Jr admin can delete 10, but the boss needs to confirm a deletion of 100.
You can always build the safeguard to require approval from a peer (or superior) to an action that is normally considered dangerous, at least for overriding the throttle.
Anything that requires human approval for routine operations quickly devolves into bureaucracy that adds a lot of manual steps without any real safety. Now, you may argue definition of "routine operation", but the thing in the article didn't sound like they were doing anything crazy.
But just imagine the "ooooh shit" moment of this person.
Something similar happened to us, when an Engineer deleted part of our production database with a single command. Fortunately, we could reconstruct it from backups and replication logs.
To me, the more important part would've been the solutions they'll come up with to avoid something like this happening again. Is it going to just be "add a line to the playbook asking the engineer to double check the command" or will they make big changes across the system to prevent things like this happening.
I'm interested in something like that, too! I've got a few destructive Ansible commands that I check a few dozen times before running. But There's always a chance that I'm tired/distracted/whatever and run something silly anyway. Typically I put in things like config test checks and prompts, but damn it's scary how much power I have with this Ansible setup. I definitely don't want to be in this AWS position.
Take a moment to look at the construction of this report.
There is no easily readable timeline. It is not discoverable from anywhere outside of social media or directly searching for it. As far as I know, customers were not emailed about this - I certainly wasn't.
You're an important business, AWS. Burying outage retrospectives and live service health data is what I expect from a much smaller shop, not the leader in cloud computing. We should all demand better.
Also notably missing is the "we will automatically refund all affected customers" line that we'd expect from somebody who wants to provide excellent service.
A graphical illustration of the service dependencies they were talking about would have been nice as well.
If you request it and provide evidence that they find compelling.
To receive a Service Credit, you must submit a claim by opening a case in the AWS Support Center. To be eligible, the credit request must be received by us by the end of the second billing cycle after which the incident occurred and must include:
the words “SLA Credit Request” in the subject line;
the dates and times of each incident of non-zero Error Rates that you are claiming; and
your request logs that document the errors and corroborate your claimed outage (any confidential or sensitive information in these logs should be removed or replaced with asterisks).
If the Monthly Uptime Percentage applicable to the month of such request is confirmed by us and is less than the applicable Service Commitment, then we will issue the Service Credit to you within one billing cycle following the month in which your request is confirmed by us. Your failure to provide the request and other information as required above will disqualify you from receiving a Service Credit."
Interesting observation. Maybe the answer is it that a behemoth like AWS does this because they _can_ get away with it. In contrast to AWS's cascading failures, the GitLab outage was a mere blip. Because they are several orders of magnitude smaller than Amazon, however, they had to be painfully transparent during their actual restore operations and in the post-moterm.
AWS has more implicit trust that this won't happen again, since they've never (I think?) had something like this happen, so just a few lines about fixing the tool that let all the nodes shutdown is enough to restore confidence.
Emails seem to be going out. I got one a while ago. I suspect this was an initial response geared towards the general audience and a more specific technical response will be forthcoming.
> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
I find making errors on production when you think you're on staging are a big one for similar errors. One of the best things I ever did on one job was to change the deployment script so that when you deployed you would get a prompt saying "Are you sure you want to deploy to production? Type 'production' to confirm". This helped stop several "oh my god, no!" situations when you repeated previous commands without thinking. For cases where you need to use SSH as well (best avoided but not always practical), it helps to use different colours, login banners and prompts for the terminals.
We have a deploy script that does exactly this, unfortunately we've all gotten our muscle memory so trained that most of us type the deploy command, press enter, type yes and press enter before we're ever even prompted. Fortunately in most cases a quick ctrl-c can prevent any actual damage.
I think the only way to nuke the muscle memory from this equation would be to have it make one type a random dictionary word (or solve an arithmetic problem or something - that one might help prevent drunk deploys ;).
If the prompt was type "production" to confirm, I'm sure I'd just as readily train myself to jump the gun on that one.
" we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected"
This is analogous to "we needed to fsck, and nobody realized how long that would take".
Rushed to office at 6am before important process was about to run that needed that host.
Plugged in keyboard+monitor, dead screen, nothing.
Physically power-cycled server.
Stood in front of monitor+keyboard. It occurred to me it was taking longer than expected to show POST screen. About that time, I got a page saying $ACTUALHOSTNAME is down.
Walk around to the back of the racks. The monitor cable had come detached from the cable extender that I plugged into the server. I had never plugged the monitor in at all, just the extension.
The server wasn't down in the first place, it just lost a virtual interface, which I was paged for, and stupidly tested that virtual interface instead of the REAL name/IP.
And then I raced to the office just so that I could cause an outage.
I once changed a piece of code that was referenced by every page on the customer facing site (10's millions of visits a day) to use a new function that someone had previously written (and was called on 1 page in the site). I mistakenly didn't really look to closely at the implementation of the function, and didn't realize how badly it's caching strategy was designed. When this code was deployed it instantly caused a thundering herd on our cache servers bringing the site down for about ~40 seconds.
My worst was that one time I accidentally took down some services that were essential for order processing via an unrelated, large stress test. My test accidentally consumed a large amount of bandwidth saturating the links to a shared service.
I did all this while sitting 2 feet from a print out of "The 8 fallacies of distributed systems". Bandwidth is indeed not infinite, can confirm.
Due to youth, totally misplaced confidence and poor access rights regime, ran an untested script in production causing it to fork uncontrollably requiring reboot.
TLDR; Someone on the team ran a command by mistake that took everything down. Good, detailed description. It happens. Out of all of Amazon's offerings, I still love S3 the most.
"It happens" is the only reasonable takeaway you can get from a postmortem like this. My worry is that people read it and go "I am aghast that such a command can be run!" without knowing that little commands like that are run numerous times a day without incident.
The only thing I read in there and go "hmmm" is that it took quite that long for the S3 service to recover, and that the status page wasn't hosted on someone that doesn't have an S3 dependency. That's just a plain "doh" moment :)
People need to realize when they go to the cloud it's not that 'it happens', it's that it will happen, and you have no ability to do anything about it. Fact of life and risk management.
... and it's a different risk from self-hosting, but self-hosting provides all sorts of similar issues (such as when you do this to yourself, the cost is now coming out of your pocket, not Amazon's, to employ software engineers to harden your scripts against making the same mistake twice).
Not to mention, Amazon is catching the long tail of cloud failure like Google is catching the long tail of search keywords. They can now say with a somewhat straight face - "You know all those scripts you run to keep everything up? We have figured out many, many more possible ways for them to fail than you probably ever will, and we have added more layers of safeguards than you can even imagine."
AWS partitions its services into isolated regions. This is great for reducing blast radius. Unfortunately, us-east-1 has many times more load than any other region. This means that scaling problems hit us-east-1 before any other region, and affect the largest slice of customers.
The lesson is that partitioning your service into isolated regions is not enough. You need to partition your load evenly, too. I can think of several ways to accomplish this:
1. Adjust pricing to incentivize customers to move load away from overloaded regions. Amazon has historically done the opposite of this by offering cheaper prices in us-east-1.
2. Calculate a good default region for each customer and show that in all documentation, in the AWS console, and in code examples.
3. Provide tools to help customers choose the right region for their service. Example: http://www.cloudping.info/ (shameless plug).
4. Split the large regions into isolated partitions and allocate customers evenly across them. For example, split us-east-1 into 10 different isolated partitions. Each customer is assigned to a particular partition when they create their account. When they use services, they will use the instances of the services from their assigned partition.
So this is the second high profile outage in the last month caused by a simple command line mistake.
> Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
If I would have guessed anyone could prevent mistakes like this from propagating it would be AWS. It points to just how easy it is to make these errors. I am sure that the SRE who made this mistake is amazing and competent and just had one bad moment.
While I hope that AWS would be as understanding as Gitlab, I doubt the outcome is the same.
Nothing is going to happen to the engineer that did this, other than embarrassment and probably a couple jokes once sufficient time has passed. Amazon has a strong culture around blaming the process, not the person. The failure here wasn't that the engineer ran the command, it was that the engineer was able to run the command.
Amazon has the wherewithal to not freaking publicly name their actually human employee, so I'd imagine their culture around outages is probably a lot more healthy.
Well to be fair, he named himself within his notes and did not object to the public nature of the disclosure. I agree with your sentiment though that names should not be included within postmortems in the general case.
tl;dr: Engineer fat-fingered a command and shut everything down. Booting it back up took a long time. Then the backlog was huge, so getting back to normal took even longer. We made the command safer, and are gonna make stuff boot faster. Finally, we couldn’t report any of this on the service status dashboard, because we’re idiots, and the dashboard runs on AWS.
Of course. Meant it as more of an "I just spent 10 minutes searching for my car keys while holding them, because I'm an idiot." No disrespect to the engineers.
Overall, it's pretty amazing that the recovery was as fast as it was. Given the throughput of S3 API calls you can imagine the kind of capacity that's needed to do a full stop followed by a full start. Cold-starting a service when it has heavy traffic immediately pouring into it can be a nightmare.
It'd be very interesting to know what kind of tech they use at AWS to throttle or do circuit breaking to allow back-end services like the indexer to come up in a manageable way.
Something that wasn't addressed -- there seems to be an architectural issue with ELB where ELBs with S3 access logs enabled had instances fail ELB health checks, presumably while the S3 API was returning 5XX. My load balancers in us-east-1 without access logs enabled were fine throughout this event. Has there been any word on this?
I think it comes down to how important your ELB logs are -- if they are important enough that you don't want to allow traffic without logs (i.e. if you're using them for some sort of auditing/compliance), then failing when it can't write the logs seems like the right choice.
Thanks, that is a fair perspective. In our case we're using ELB logs as a redundant trace and it isn't critical that our traffic stops if the access logs fail. It would be nice if this behavior became a toggle in ELB settings, but think we can set something up to disable access logs programatically if we start seeing S3 issues.
Good luck with this. We tried to make changes yesterday to mitigate impact but AWS console was also affected. Was hesitant to make API calls for the changes since we werent sure they would complete successfully given all the services we found actually depended on S3 internally.
Really pleased to see this, it's good to see an organisation that's being transparent (and maybe given us a little peek under the hood of how S3 is architected) and most importantly they seem quite humbled.
It would be easy for an arrogant organisation to fire or negatively impact the person that made the mistake, I hope Amazon don't fall into that trap and focus instead on learning from what happened, closing the book and move on.
There are quite a few comments here ignoring the clarity that hindsight is giving them. Apparently the devops engineers commenting here have never fucked up.
On the contrary: I feel like Amazon is taking some flak because everyone here has messed up before, and are surprised that engineers (seemingly lacking failure experience) were able to do what they did.
I wouldn't task a junior sysadmin a server deletion, would you? Nor could I ever consider someone without a fuckup a senior ;)
This is a bit off topic. The use of the word "playbook" suggests to me that they use Ansible to help manage S3. I wonder if that is the case, or if it's just internal lingo that means "a script". Unless there is some other configuration management system that uses the word playbook that I'm not aware of.
I'm genuinely curious. As my experiments with it have left me disappointed with its performance, I'm just not sure what I could use it for. Store massive amounts of data that is infrequently accessed? Well, unfortunately the upload speed I got to the standard rating one was so abysmal it would take too much time to move the data there; and then I suspect the inverse would be pretty bad as well.
S3 uploads can be very dependent on what the network looks like between your systems and the endpoint. We introduced S3 Transfer Acceleration to help address this. This uses our edge network as your endpoint, then sending your upload across our backbone rather than traversing the commercial internet. It comes with a small fee, but the fee only gets applied if it improves on what the time to transfer would be otherwise.
If you have lots of data that needs to be uploads (TB/PB worth), then I'd take a look at AWS Snowball. https://aws.amazon.com/snowball/
Also, if you're using the AWS CLI to upload, make sure multi-part upload is enabled.
Why use S3? It scales without any user interaction, is highly available (yes, I cringe saying that, but this is the first, and hopefully only time this has occurred! =D ) and extremely easy to access; it's as simple as an HTTP GET. Being able to address objects directly and not have to worry about managing file systems simplifies a lot.
1 scenario: if you run a website that has a lot of static content (multiple GB of images, css, js, etc) and you dont want your http server to be responsible for serving that content then you give it all to s3 and let them serve it for you.
Performance out of S3 is generally really good. However, if you're looking to say, serve up a global website and your content is in a single S3 region, then you can leverage CloudFront CDN to serve up those objects. CloudFront integrates seamlessly with S3, and you don't pay transfer charges between CloudFront and S3.
Database backups: with the upload speeds I've seen, completing a backup of a database with hundreds of GB would take a really long time. I don't want that extra load and keeping that connection open forever.
> (...) [W]e have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.
All those tweets saying "turn it off and back on again"?
"We accidentally turned it off, but it hasn't been turned it off for so long it took us hours to figure out how to turn it back on."
Poorly-presented jokes aside, this is rather concerning. The indexer and placement systems are SPOFs!! I mean, I'd presume these subsystems had ultra-low-latency hot failover, but this says they never restarted, and I wonder if AWS didn't simply invest a ton of magic pixie dust in making Absolutely Totally Sure™ the subsystems physically, literally never crashed in years. Impressive engineering but also very scary.
At least they've restarted it now.
And I'm guessing the current hires now know a lot about the indexer and placer, which won't do any harm to the sharding effort (I presume this'll be being sharded quicksmart).
I wonder if all the approval guys just photocopied their signatures onto a run of blank forms, heheh.
I don't think you don't understand the architecture of the system, if you are describing the indexer as a SPOF.
The system is a collection of shards. If you replicate it to create a second shard, then you'll just have a large a system, which is still a single point of failure.
The index, by necessity, has to be able to answer the question 'this object exists' or 'this object doesn't exit' - so it needs to have consensus.
My speculative presumption was going off the sole datapoint of "we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years". I'm not quite sure how to interpret "restart" in this context, mostly due to lack of exposure or experience.
The report also says "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems." So you're right, it looks like multiple servers were supporting these systems, which does make sense (especially considering the load they would have seen). Okay.
I guess I didn't quite think through the load requirements and thought these were single machines - which is certainly ludicrous thinking :) - and that's where I got the SPOF reasoning from.
You're very right though, these consensus systems must be built as bottlenecks in order to see everything.
And there aren't really any alternatives: "build extra indexers and placement systems!" just gives you "but what if _all_ of them get taken offline?" and "it can't leave the datacenter, it sees 100GB/s of throughput" (number taken out of thin air).
CEO's all over the world just realized that they can't only depend on S3, and they might have to double up on their infrastructure and have a parallel env. on Azure or Google as well.
I keep being reminded of something I read recently that made me feel uneasy about google's cloud spanner [1]:
the most important one is that Spanner runs on Google’s private network. Unlike most wide-area networks, and especially the public internet, Google controls the entire network and thus can ensure redundancy of hardware and paths, and can also control upgrades and operations in general. Fibers will still be cut, and equipment will fail, but the overall system remains quite robust.
It also took years of operational improvements to get to this point. For much of the last decade, Google has improved its redundancy, its fault containment and, above all, its processes for evolution. We found that the network contributed less than 10% of Spanner’s already rare outages.
I am unpleasantly surprised that they do not mention why services that should be unrelated to S3 such as SES were impacted as well and what they are doing to reduce such dependencies.
From a software development perspective, it makes sense to reuse S3 and rely on it internally if you need object storage, but from an ops perspective, it means that S3 is now a single point of failure and that SES's reliability will always be capped by S3's reliability. From a customer perspective, the hard dependency between SES and S3 is not obvious and is disappointing.
The whole internet was talking about S3 when the AWS status dashboard did not show any outage, but very few people mentioned other services such as SES. Next time we encounter errors with SES, should we check for hints of S3 outage before everything else? Should we also check for EC2 outage?
services that should be unrelated to S3 such as SES were impacted
I don't think this is particularly surprising. I'd already pretty much assumed that, e.g., a package of code for a Lambda function would be housed in an S3 bucket somewhere.
What's really surprising to me is how many of those buckets appear to live in US-EAST-1, and aren't able to keep functioning in a catastrophe by failing over to a different region.
I don't think it's automatic. I just helped my former boss with his decision to go for a refund (he asked me for help drafting a request, but I reminded him that like 99.99% of their S3 storage is backups that are IA-Standard, so it may not be worth it).
Meh, that's a process problem, not a people problem. Playbooks that have you retype commands with complex options, with no confirmation, etc, are inviting that sort of thing.
The wording of the article implies Amazon is shifting the blame entirely on the individual who typo'd: they indemnify themselves with "an authorized S3 team member using an established playbook..." ("don't blame us, our process is perfect!")
There are process-fixes for this, such as requiring a two-person rule when at a production shell and modifying tooling to detect potentially unintentional commands (e.g. a SQL UPDATE without a WHERE) - but given what I know about Amazon's internal practices (i.e. the brutality) it wouldn't surprise me if they did terminate the unfortunate operator - not because they want to, but because AWS simply has too many large-scale customers who would demand immediate action like that.
It's both. This isn't "our system was compromised by attack;" it's "SNAFU."
Everyone who's had operations experience knows that there will be, as time approaches infinity, more than zero SNAFU. That's why companies offer five nines of uptime, not 100% uptime.
I brought down our production system after a typo in a command once... the dev team took the blame for allowing an illegal parameter to bring down the system.
It was a very well-run engineering department where taking blame was not a career ending decision. I took full blame for the typo (at 3am trying to resolve a customer issue), but the dev team accepted full responsibility for letting it take down the system.
Every mistake was used as a learning opportunity to ensure that the same and similar mistakes can't be repeated.
He took off in his piston engine plane, only to lose power during the climb and was forced to make a crash landing. It turned out the airplane was fueled with jet fuel instead of regular gasoline (the ground crewman mistakenly thought the plane was a turbo prop).
Instead of yelling at or firing the ground crewman, Hoover had this to say[2]:
"There isn't a man alive who hasn't made a mistake.
But I'm positive you'll never make this mistake again.
That's why I want to make sure that you're the only one
to refuel my plane tomorrow. I won't let anyone else
on the field touch it."
On a more serious note, if you've never done something like this, you haven't had enough interesting projects.
I've had a decent career and I still managed to:
* re-deploy the current application version in all our data centers, instead of the new version, in a period when our deployment wasn't a 0-downtime one
* rename all the Jenkins jobs on the server to the same name, thus deleting hundreds of Jenkins jobs in one fell swoop
"Let him who is without sin cast the first stone" and all that :)
There's two things I've come to believe in my IT career regarding operations:
1. No organization anywhere is a paragon of excellence, and everyone can benefit from improvement.
2. Every organization is made up of humans just like you. With all that entails.
Some things which seem blatantly obvious after the fact are easily overlooked when the pressure to deliver is high and other issues are taking precedence.
I completely agree with your statement. In fact, when I do interviews, one of my favorite and most insightful questions to ask is, basically, "tell me about a time you screwed the pooch." If they don't have a story and they worked in ops, then it can suggest they didn't really do much. The really sharp ones I've interviewed have a good story or two (and can tell it in excruciating detail. =)
* At a prior company I once tried appending to the list of NFS exports, but dropped the "no-root-squash" option, and instantly denied write permissions to our entire VMware farm. You can imagine what then happened to all of the VMs for this mission critical customer. =P
As I keep saying, there are people who screw up big time and people who are too scared to touch the system. I managed to gun down three productive clusters by deploying NTP by accident along with a tiny change. Kerpow, 12 minutes of downtime, full network outage due to DHCP and such. Great fun.
"Seventy-three? Wow, I hadn't realized our system grew that much. Probably a new backend dependency got added that I'm not familiar with yet; I'll look into it later." (Y)
Every sysadmin at my previous job (a Fortune 500) would stop the moment that number is off the expected by a few machines, just long enough to verify it's correct. That may be due to having made a mistake like this once. I know that's true in several other large shops, as well.
Source: I'm was the one the one they would call for our team... usually at 4:00 AM because one of our team members (which was also frequently me) didn't document something correctly.
If Amazon were a guy, he'd be a standup guy. This is a very detailed and responsible explanation. S3 has revolutionized my businesses and I love that service to no end. These problems happen very rarely but I may have backups just in case using nginx proxy approach at some point and because S3 is so good, everyone seems to adopt their API so its just a matter of a switch statement. Werner can sweat less. Props.
I would add, it would be awesome if there was a simulation environment, beyond just a test environment that simulated servers outside requesting in, before a command was allowed to run onto production, like a robot deciding this, then could mitigate this, kind of like TDD on steriods if they don't have that already.
I can imagine being that guy in that exact moment. But I can't imagine being that guy after the event. There will be a constant fear and doubt in my mind. And a constant fear whether others trust me anymore. I couldn't quit because that might make me look bad and I couldn't continue because that might make me look bad.
Twitter once had 2 hours of downtime because an operations engineer accidentally asked a tool to restart all memcached servers instead of a certain server. The tool was then changed to make sure that you couldn't restart more than a few servers without additional confirmation. Sounds very similar to this situation. Something to think about when you are building your tools to be more error proof.
> Removing a significant portion of the capacity caused each of these systems to require a full restart.
I'd be interested to understand why a cold restart was needed in the first place. That seems like kind of a big deal. I can understand many reasons why it might be necessary, but that seems like one of the issues that's important to address.
Possibly a consensus algorithm that refuses writes when it detects itself in a minority, because it think it's in the smaller part of a split-brain scenario.
In this case, throwing away and then re-provisioning the split-off nodes is a viable approach.
It sounds like this can be mitigated by making sure everything is run in dry run mode first, and for something mission critical, getting it double-checked by someone before removing the dry run constraint.
It's good practice in general, and I'm kind of astonished it's not part of the operational procedures in AWS, as this would have quickly been caught and fixed before ever going out to production.
I'm not sure you've understood what the problem is. They were removing some servers from a group. This isn't something that get's a dry run, not without spinning up the entire AWS infrastructure. It also wouldn't have helped a jot, since the issue came about after an employee executing a playbook made a typo.
There's no way this could have been mitigated with a dry run. They're mitigating it in future by putting more aggressive safeguards in their tooling, which is the correct way to mitigate this sort of issue.
"As a result, (personal experience and anecdotal evidence suggest that) for complex continuously available systems, Operations Error tends to be the weakest link in the uptime chain."
I guess it is time to define commands whose inputs have great distance in, say the Damerau-Levenshtein metric.
For numerical inputs, one might use both the digits and the textual expression. This would make them quite cumbersone but much less prone to errors. Or devise some shorthand for them...
156 (on fi six). 35. (zer th fi). 170 (on se zer). 28 (two eig)
evens have three letters odds have two.
This reminds me of Asimov's characteristically tiny story "Fault-Intolerant" https://unotices.com/book.php?id=38686&page=15 (You can ignore the story at the top about Feghoot, the real story is below.)
Wonder if every numbers for critical command lines shouldn't be spelled out as well. If you think about how checks works, you're supposed to write the number as well as the words for the number.
-nbs two_hundreds instead of twenty
is much less likely to happen..
> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.
This is the bit that'd worry me most; you'd think they'd be testing this.
A complete restart of the index subsystem would require downtime. Note: they are not saying those servers have never been restarted - it's highly likely they get restarted regularly. But, a complete restart of the index subsystem implies that you shut everything down first and restart it all at once, which is what was forced to happen two days ago.
thats kind of like asking why git doesn't have a single backup... it's a distributed system, there's not just one backup, there are lots of little partial backups.
This caused panic and chaos for a bit among my team, which I imagine was replicated across the web.
Moments like these always remind me that a particularly clever or nefarious set of individuals could shut down essential parts of the Internet with a few surgical incisions.
Seems like something like Chaos monkey should have been able to predict and mitigate a issue like this. Im actually curious if anyone uses it at all- Curious if anyone in here at a large company (over 500 employees) has it deployed or not.
I think they should have led with insensitivity about it and maybe a white lie. Such as... We took our main region us-east-1 down for X hours because we wanted to remind people they need to design for failure of a region :-)
Shameless plugs (authored months ago):
http://tuxlabs.com/?p=380 - How To: Maximize Availability Effeciently Using AWS Availability Zones ( note read it, its not just about AZ's it is very clear to state multi-regions and better yet multi-cloud segway...second article)
http://tuxlabs.com/?p=430 - AWS, Google Cloud, Azure and the singularity of the future Internet
This makes me want to write a program that would ask users to confirm commands if it thinks they are running a known playbook and deviating from it. Does anyone know if a tool like that exists?
Not sure, but my company's fleet wide root scripts confirm first the exact command you want to run, then run on 1 host first and output the the full logs for you to inspect/confirm, and then finally start the full fleet wide run after you have confirmed the expected result of your output. They also output the full logs of across the entire fleet once your fleet wide script is run.
For as much as people jumped all over Gitlab last month, this seems remarkably similar in terms of preparedness for accidental and unanticipated failure.
Beyond a typing mistake, it's not really very similar. The Gitlab incident was one avoidable problem after another, ending with a giant WTF when they found out that no-one had even tested the backups were working.
This is a case of someone slipping on the keyboard, removing more capacity than intended and the recovery process taking longer than expected. The process actually seems to be working (to a given value of working), but the amount of downtime was way above acceptable. They've already put more safeguards into the tooling to prevent the situation from happening again.
S3 is also orders of magnitude more complex than Gitlabs infrastructure, so while the amount of time the outage lasted for is not acceptable, it does show that they at least have working processes for critical situations that allow them to get back in service within a day, which is pretty impressive.
I assume "reboot" in this instance means more than turning it off and on again--it must return to a working state, with many volumes of data requiring log processing to find the last (and best) "good state".
I don't think those would only have a mere 1 MB of data ;) Even considering RAM speeds, stuff scales in surprising ways when a third of the Internet relies on you. (Right, right, exaggerating - this was one AZ only; but you catch the drift, I assume)
No on in HN is questioning this - "The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected." - they are debugging on Production System..
What most AWS customers don't realize is that AWS is poorly automated. Their reliability relies on exploiting the employees to manually operate the systems. The technical bar at Amazon is incredibly low and they can't retain any good engineers.
What's missing is addressing the problems with their status page system, and how we all had to use Hacker News and other sources to confirm that US East was borked.
>We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.
"From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions."
I think they should have led with insensitivity about it and maybe a white lie. Such as... We took our main region us-east-1 down for X hours because we wanted to remind people they need to design for failure of a region :-)
Shameless plugs (authored months ago):
http://tuxlabs.com/?p=380 - How To: Maximize Availability Effeciently Using AWS Availability Zones ( note read it, its not just about AZ's it is very clear to state multi-regions and better yet multi-cloud segway...second article)
For the many of us who have built businesses dependent on S3, is anyone else surprised at a few assumptions embedded here?
* "authorized S3 team member" -- how did this team member acquire these elevated privs?
* Running playbooks is done by one member without a second set of eyes or approval?
* "we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years"
The good news:
* "The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately."
The truly embarrassing that everyone has known about for years is the status page:
* "we were unable to update the individual services’ status on the AWS Service Health Dashboard "
When there is a wildly-popular Chrome plugin to fix your page ("Real AWS Status") you would think a company as responsive as AWS would have fixed this years ago.
It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.