It's not always as simple as that. What if the problem was that something in a c...

greenleafjacob · on July 12, 2019

In that case you would probably still roll back to prevent further data corruption and restore the corrupted records from backups.

There are certainly changes that cannot be rolled back such that the affected users are magically fixed, which is not what I am suggesting. In the context of mission critical systems, mitigation is usually strongly preferred. For example, the Google SRE book says the following:

> Your first response in a major outage may be to start troubleshooting and try to find a root cause as quickly as possible. Ignore that instinct!

> Instead, your course of action should be to make the system work as well as it can under the circumstances. This may entail emergency options, such as diverting traffic from a broken cluster to others that are still working, dropping traffic wholesale to prevent a cascading failure, or disabling subsystems to lighten the load. Stopping the bleeding should be your first priority; you aren’t helping your users if the system dies while you’re root-causing. [...] The highest priority is to resolve the issue at hand quickly.”

I have seen too many incidents (one in the last 2 days in fact) that were prolonged because people dismissed blindly rolling back changes, merely because they thought the changes were not the root cause.

Silhouette · on July 12, 2019

In that case you would probably still roll back to prevent further data corruption and restore the corrupted records from backups.

OK, but then what if it's new data being stored in real time, so there isn't any previous backup with the data in the intended form? In this case, we're talking about Stripe, which is presumably processing a high volume of financial transactions even in just a few minutes. Obviously there is no good option if your choice is between preventing some or all of your new transactions or losing data about some of your previous transactions, but it doesn't seem unreasonable to do at least some cursory checking about whether you're about to cause the latter effect before you roll back.

londons_explore · on July 12, 2019

I think you guys are considering this from the wrong angle...

Rollbacks should always be safe. They should always be automatically tested. So a software release should do a gradual rollout (ie. 1, 10, 100, 1000 servers), but it should also restart a few servers with the old software version just to check a rollback still works.

The rollout should fail if health checks (including checking business metrics like conversion rates) on the new release or old release fails.

If only the new release fails, a rollback should be initiated automatically.

If only the old release fails, the system is in a fragile but still working state for a human to decide what to do.

Silhouette · on July 13, 2019

This is one of those ideas that looks simple enough until you actually have to do it, and then you realise all the problems with it.

For example, in order to avoid any possibility of data loss at all using such a system, you need to continue running all of your transactions through the previous version of your system as well as the new version until you're happy that the performance of the new version is satisfactory. In the event of any divergence you probably need to keep the output of the previous version but also report the anomaly to whoever should investigate it.

But then if you're monitoring your production system, how do you make that decision about the performance of the new version being acceptable? If you're looking at metrics like conversion rates, you're going to need a certain amount of time to get a statistically significant result if anything has broken. Depending on your system and what constitutes a conversion, that might take seconds or it might take days. And you can only make a single change, which can therefore be rolled back to exactly the previous version without any confounding factors, during that whole time.

And even if you provide a doubled-up set of resources to run new versions in parallel and you insist on only rolling out a single change to your entire system during a period of time that might last for days in case extended use demonstrates a problem that should trigger an automatic rollback, you're still only protecting yourself against problems that would show up in whatever metric(s) you chose to monitor. The real horror stories are very often the result of failure modes that no-one anticipated or tried to guard against.

jdhendrickson · on July 12, 2019

I think the 80 / 20 rule applies here.

Silhouette · on July 12, 2019

My point was that it's all but impossible for any rollback to be entirely risk-free in this sort of situation. If everything was understood well enough and if everything was working to spec well enough for that to happen, you wouldn't be in a situation where you had to decide whether to make a quick rollback in the first place.

I'm not saying that the decision won't be to do the rollback much of the time. I'm just saying it's unlikely to be entirely without risk and so there is a decision to be considered. Rolling back on autopilot is probably a bad idea no matter how good a change management process you might use, unless perhaps we're talking about some sort of automatic mechanism that could do so almost immediately, before there was enough time for significant amounts of data to be accumulated and then potentially lost by the rollback.