Sent in a SQL command deleting multiple million records (intended) wrapped in a single transaction (not intended). The replication queues could not keep up and failed, bringing down most of the replicas. Master server kept trying to recover and maxed out all connections - no DBA could log in to perform manual recovery. We had to hard reboot not knowing what state the system is in and how long it'll take to fully recover. Did I mention that this was a few hours before trading was going to begin?
TBH, my team was very gracious about it and the RCA focused purely on the events that occurred and how to never let if happen it again. No blame game at all.
> TBH, my team was very gracious about it and the RCA focused purely on the events that occurred and how to never let if happen it again. No blame game at all.
Which is how a PIR, PER or PCR should be. If you don't understand why someone makes a mistake, you can't avoid future mistakes.
Hmmm. I can't help but wonder if maybe the database engine should have caught the queue overflow (even if the event occurred over the network) and failed the transaction.
TBH, my team was very gracious about it and the RCA focused purely on the events that occurred and how to never let if happen it again. No blame game at all.