Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sent in a SQL command deleting multiple million records (intended) wrapped in a single transaction (not intended). The replication queues could not keep up and failed, bringing down most of the replicas. Master server kept trying to recover and maxed out all connections - no DBA could log in to perform manual recovery. We had to hard reboot not knowing what state the system is in and how long it'll take to fully recover. Did I mention that this was a few hours before trading was going to begin?

TBH, my team was very gracious about it and the RCA focused purely on the events that occurred and how to never let if happen it again. No blame game at all.



> TBH, my team was very gracious about it and the RCA focused purely on the events that occurred and how to never let if happen it again. No blame game at all.

Which is how a PIR, PER or PCR should be. If you don't understand why someone makes a mistake, you can't avoid future mistakes.


I understand SQL, DBA and TBH, but what do RCA, PIR, PER, and PCR stand for?


RCA is "Root Cause Analysis" and I assume PIR is "Post Incident Review". I don't know PER or PCR.


"Post Event Review" and "Post Change Review".


Hmmm. I can't help but wonder if maybe the database engine should have caught the queue overflow (even if the event occurred over the network) and failed the transaction.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: