You don't have to know what would be a mistake. E.g. if the tool is used most of the time to operate on a small set of servers, you have some extra confirmation or command-line option for removing a large set.
That's good UI design in tools with powerful destructive capabilities. You make the UI to do lots of things v.s. the few things you do routinely different enough that there's no mistaking them.
Yes, but be careful. UIs like that tend to accumulate "--yes" options, because you don't feel like being asked every time for 1 server. Then one day you screw up the wildcard and it's 1000 servers, but you used the --yes template.
Which is why I'm pointing out that to design UIs like these you should fall back on slightly different UIs depending on the severity of the operation.
This is a good pattern to use. The more pre-feedback I get, the less likely I am to make a horrible mistake.
However one problem I often see with this pattern is the numbers are not formatted for humans to read. Suppose it prompts:
"1382345166 agents will be affected. Proceed? (y/n)"
Was that ~100k or ~1M agents? I can't tell unless I count the number of digits, which itself is slow and error-prone. It's worse if I'm in the middle of some high-pressure operation, because this verification detour will break my concentration and maybe I'll forget some important detail.
Now if the number is formatted for a human to consume, I don't have to break flow and am much less likely to make an "order-of-magnitude error":
"1,382,345,166 (1.4M) agents will be affected. Proceed? (y/n)"
I always attempt to build tooling & automation and use it during a project, rather than running lots of one-off commands. I find this usually saves me & my team a lot of time over the course of a project, and helps reduce the number of magical incantations I need to keep stored in my limited mental rolodex. I seem to have better outcomes than when I build automation as an afterthought.
I think it depends on the quality of the feedback. Most tooling sucks, so the messages are very literal trace statements peppered through the code. , vs what the user-facing impact will be. When the thing is just spitting raw information at me, I'm probably going to train myself to ignore it. But if it can tell me what is going to happen, in terms that I care about, then I'll pay attention.
Imagine I just entered a command to remove too many servers that will cause an outage:
"Finished removing servers"
(better than no message, I suppose)
vs
"Finished removing 8 servers"
(better, it's still too late to prevent my mistake
but at least I can figure out the scale of my mistake)
vs
"8 servers will be removed. Press `y` to continue"
(better, no indication of impact but if I'm paying
attention I might catch the mistake)
vs
"40% capacity (8 servers) will be removed.
Load will increase by 66% on the remaining 12 servers.
This is above the safety threshold of a 20% increase.
You can override by entering `live dangerously`."
(preemptive safety check--imagine the text is also red so it stands out)
Obviously some UIs make some errors less likely. You don't have the "launch the nukes" button right next to the "make coffee" button, because humans are clumsy and don't pay attention.
Fat-finger implies you made your mistake once. A UI can't stop you from setting out to do the wrong thing, but it can make it astronomically unlikely to do a different action than the one you intended.
Simple example: I have a git hook which complains at me if I push to master. If I decide "screw you, I want to push to master", it can't assess my decision, but it easily fixes "oops, I thought I was on my branch".
There's a balance to be struck. I'd say number of hoops you have to jump through to do something should scale with the potential impact of an operation.
That said, the only way to completely prevent mistakes is to make the tool unable to do anything at all.
(Or to encode every possible meaning of the word "mistake" in your software. If you could do that, you would probably get a Nobel prize for it.)
In a program I wrote I make the user manually type "I AGREE" (case-sensitive) in a prompt before continuing, just to avoid situations where people just tap "y" a bunch of times.
Habituation is a powerful thing: a safety-critical program used in the 90s had a similar, hard-coded safety prompt (<10 uppercase ASCII characters). Within a few weeks, all elevated permission users had the combination committed to muscle memory and would bang it out without hesitation, just by reflex: "Warning: please confirm these potentially unsaf-" "IAGREE!"
It's indeed a real problem. Hell, I myself am habituated to logins and passwords for frequently used dialog boxes, and so just two days ago I tried to log in on my work's JIRA account using test credentials for an app we're developing...
For securing very dangerous commands, I'd recommend asking the user to retype a phrase composed of random words, or maybe a random 8-character hexadecimal number - something that's different every time, so can't be memorized.
I think that even if someone can't memorize the exact characters, they'll memorize the task of having to type over the characters. Better would be to never ask for confirmation except in the worst of worst cases.
That's what I meant in my original comment when I wrote that "number of hoops you have to jump through to do something should scale with the potential impact of an operation". Harmless operations - no confirmation. Something that could mess up your work - y-or-n-p confirmation. Something that could fuck up the whole infrastructure - you'd better get ready to retype a mix of "I DO UNDERSTAND WHAT I'M JUST ABOUT TO DO" and some random hashes.
I've almost deleted my heroku production server even though you need to type (or copy paste....ahem...) the full server name (e.g. thawing-temple-23345).
I think the reason was that because in my mind I was 100% sure this was the right server, when the confirmation came up I didn't stopped to look if indeed this was the correct one so I mechanically started to type the name of the server and just a second before I clicked ok, I had this genius idea to double check.... Oh boy... My heart dropped to the floor when I realized what was I about to do.
You could say that indeed Heroku's system of avoiding errors worked correctly....
However the confirmation dialog wasn't what made me stop... Instead it was my past-self's experience screaming at me and remembering me that ONE time where I did fucked up a production server years ago (it cost the company a full day of customers' bids... Imagine the shame of calling all the winning bidders and asking them what price did they end up bidding to win....)
My point is, maybe no number of confirmation dialogs however complex they are, will stop mistakes if the operator is fixed on doing X. If you are working in a semi-autopilot mode because you obviously are very smart and careful (ahem..) you will just do whatever the dialog asks you to do without actually thinking what you are doing.
What, then, will make you stop and verify? My only guess is that experience is the only way. I.e. only when you seriously fuck up you learn that no matter how many safety systems or complex confirmation dialogs there are you still need to double and triple check each character you typed, lest you want to go through that bad experience again....
A well-designed confirmation doesn't give you the same prompt for deleting some random test server as it does for deleting a production server. That helps with the "autopilot mode" issue.
I agree that it should help reduce the amount of mistakes.
But I still believe auto-pilot mode is a real thing (and a danger!) .
My point is that I'm not sure if it's even possible to design one that actually cuts errors to 0.
And if that's indeed the case, even if it's close to 0, it's still non-zero, thus at the scale Amazon operates at, it's very probable that it will happen at least one time.
Maybe sometime in the future AI systems will help here?
I totally agree that it's a real issue, a danger, and that it's impossible to cut errors to zero.
I've also built complex systems that have been run in production for years with relatively few typo-related problems. The way I do it is with the design patterns like the one I just mentioned, which is also what TeMPOraL was talking about (and I guess you missed it.)
If you have the same kind of confirmation whenever you delete a thing, whether it's an important thing or not, you're designing a system which encourages bad auto-pilot habits.
You'll also note that Amazon's description of the way that they plan on changing their system is intended to fire extra confirmation only when it looks like the operator is about to make a massive mistake. That follows the design pattern I'm suggesting.
You could go further and try to prevent cat-on-the-keyboard mistakes, which is maybe what you're describing (solve this math equation to prove you are a human who is sufficiently not inebriated). Or even further and prevent malicious, trench-coat wearing, pointy-nosed trouble-makers.
The point is, yes, it is possible. That's what good design does.
It's not possible to be perfect, but you can certainly do better than taking down S3 because of a single command gone wrong.
One thing I have been doing for my own command line tools is making a preview feature for what a command will do and make the preview state be default. It's simple, but if the S3 engineer first saw a readout of the huge list of servers that were going to be taken offline instead of the small expected list we probably would not be talking about this. There's obviously a ton more you can do here (have the tool throw up "are you sure" messages for unusual inputs, etc).
If the computer knows exactly what actions would be a mistake - how? The difference between correct and incorrect (not to mention legal and illegal) is usually inferred from a much wider context than what is accessible to a script. Mind you, in this specific case, Amazon even implies that such a command could have been correct under other circumstances.
So, this means a) strong superhuman AI (good luck), b) deciding from an ambiguous input to one of possibly mistaken actions (good luck mapping all possible correct states), or c) drool-proof interface ("It looks you're trying to shut down S3, would you like some help with that?").
TL;DR: yes, but it's a cure worse than the disease.
Possibly the values are all with range. It was just that this operation only worked on elements that were a subset. No amount of validation will catch that error.
You could feedback a clarification, but if that happens too often nobody will double check it after they have seen it over and over.
While you can't prevent user error without preventing user capability, you can (as others have observed) follow some common heuristics to avoid common failure modes.
A confirm step in something as sensitive as this operation is important. It won't stop all user error, but it gives a user about to accidentally turn off the lights on US-EAST-1 an opportunity to realize that's what their command will do.
If you have UI that allows to undeploy 10 servers, it will also allow to undeploy 100 servers. Unless you specifically thought about possibility that there might be lower bound of number of servers, which they obviously didn't before that. It's easy to talk about it after the fact, but nobody is able to predict all such scenarios in advance - there are just too many ways to mess up to have special code for all of them in advance.
The tool as a whole should incorporate a model of S3. Any action you take through the UI should first be applied to this model, and then the resulting impact analyzed. If the impact is "service goes down", then don't apply the action without raising red flags.
Where I work we use PCS for high availability, and it bugs the heck out of me that a fat-fingered command can bring down a service. PCS knows what the effect of any given command will be, but there's no way (that I know of) to do a "dry run" to see whether your services would remain up afterward.
In practice, it would likely be very hard to make a model of your infrastructure to test against, but I can imagine a tool that would run each query against a set of heuristics, and if any flags pop up, it would make you jump through some hoops to confirm. Such a tool should NEVER have an option to silently confirm, and the only way to adjust a heuristic if it becomes invalid should be formally getting someone from an appropriate department to change it and sign off on it.
By the way, this is how companies acquire red tape. It's like scar tissue.
They probably didn't know the service would go down. For that, you need to identify the minimal requirements for the service to be up upfront, and code that requirements into the UI upfront. Most tools don't do that. File managers don't check the file you delete is not necessary for any of the installed software packages to run. Shells don't check the file you overwriting doesn't contain vital config file. Firewall UIs don't check that this port you're closing isn't vital for some infrastructural service. It would be nice to have a benevolent omniscient God-like UI that would have foresight to check such things - but usually the way it works is that you know about these things after the first (if you're lucky) time it breaks.
Or it makes you re-enter the quantity of affected targets as a confirmation, similar to the way GitHub requires a second entry of a repo name for deletion.
Agreed -- a postmortem should cite that deployment goof as the immediate cause, with a contributory cause of "you can goof like this without getting a warning etc".