You have to be able to take the very demanding step of saying, "No. We're not doing that. I don't care if this change caused a SEV, we're not adding process. Moving fast is important!".
The moment that some minor disaster hits, which it inevitably will, whoever is in charge will find themselves no longer in charge.
I think this is one of the fatal flaws of a big company. Everyone is terrified of risk. I'm not sure it's possible to counteract this tendency unless the company was built from the ground up not to punish risky behavior (e.g. Facebook, supposedly).
Right. You have to create a culture where it's okay to fail. And it's not enough to just say that: you have to actually practice it. You have to let people make mistakes, then do nothing so that other people don't feel afraid to take risks. Hell, reward people for failing. Highlight them as exemplars of people who do things. If you don't fail once in a while, you're not moving fast enough.
Rewarding failure can have pitfalls (if we mean something like customer impacting fallout). You may then be encouraging people that create broken things over people that more quietly create well-functioning things the first time around.
Enabling people willing to act is important though and I fully support the do nothing aspect of failure, or better yet support and stabilize but not reward.
> Enabling people willing to act is important though
I think that's the key thing here. Rather than saying "whether you succeed or fail, you're still gonna get your $250 bonus", you're saying "if you try and succeed, you get $500- if you try and fail, you get $100, if you don't try at all, you get nothing."
Yeah, but this is at odds with virtually every big company. No one has the power to affect that kind of change except the CEO, and the CEO is rarely interested or involved enough to do that. Is there any way around this problem?
If your CTO or VP of engineering or VP of ops can't make this change, your company likely has other problems.
I admit to inheriting a fairly smart-risk-accepting technical culture, but also worked to extend and cement that by holding regular blame-free post-mortems on production problems and reporting them weekly to our business operations meeting. Making failure and the analysis/correction thereof a regular part of company operations makes it normal, accepted, and less scary. We would also pretty regularly respond to asteroid-type production problems with "no preventative action intended; cost of prevention exceeds expected losses". (We held a view that there is [conceptual process] green tape and red tape; make sure if you're fixing problems with process that it's actually green tape and if fixing a problem required adding red tape to the system, we were very skeptical and tended to avoid adding that process/step/gate/check.)
Couple that mindset and transparency with a metrics-supported track record of making things better overall and management will support. I'm not sure our CEO ever saw any of that sausage-making, except for the very largest or most damaging problems and even then, it was mostly a courtesy message to him. He cared about the overall pace and metrics, not how many outages or bugs we had along the way.
The moment that some minor disaster hits, which it inevitably will, whoever is in charge will find themselves no longer in charge.
I think this is one of the fatal flaws of a big company. Everyone is terrified of risk. I'm not sure it's possible to counteract this tendency unless the company was built from the ground up not to punish risky behavior (e.g. Facebook, supposedly).