> Removing a significant portion of the capacity caused each of these systems to require a full restart.
I'd be interested to understand why a cold restart was needed in the first place. That seems like kind of a big deal. I can understand many reasons why it might be necessary, but that seems like one of the issues that's important to address.
Possibly a consensus algorithm that refuses writes when it detects itself in a minority, because it think it's in the smaller part of a split-brain scenario.
In this case, throwing away and then re-provisioning the split-off nodes is a viable approach.
I'd be interested to understand why a cold restart was needed in the first place. That seems like kind of a big deal. I can understand many reasons why it might be necessary, but that seems like one of the issues that's important to address.