Back in the days of MyISAM and before Google had their own ad network I worked for the world's largest advertising network. It had a global reach of 75%, meaning 3 / 4s of people saw at least one of our ads daily.
I was trying to learn MySQL and the CTO made the mistake of giving me access to the prod database. This huge network that served most of the ads in the world ran off of only two huge servers running in an office outside Los Angeles.
MyISAM uses a read lock on every SELECT query. I did not know this at the time. I was running a number of queries that were trying to pull historical performance data for all our ads across all time. They were taking a long time so I let them run in the background while working on a spreadsheet somewhere else.
A little while later I hear some murmuring. Apparently the whole network was down. The engineering team was frantically trying to find the cause of the problem. Eventually, the CTO approaches my desk. "Were you running some queries on the database?" "Yes." "The query you ran was trying to generate billions of rows of results and locked up the entire database. Roughly three quarters of the ads in the world have been gone for almost two hours."
After the second time I did this, he showed me the MySQL EXPLAIN command and I finally twigged that some kinds of JOINs can go exponential.
Kudos to him for never revoking my access and letting me learn things the hard way. Also, if he worked for me I would have fired him.
Sounds like you appreciated that your boss gave you space to learn, and understood that you made an honest mistake, but you’d fire someone who made this mistake if they were working for you?
It's not good to punish people for making mistakes in the course of their work (especially if that work is meant to be educational)
It is good to punish people who give access to production databases to people who shouldn't have it. And the guy learning MySQL should not be given that access.
Taking down prod is always a symptom of a systemic failure. The person responsible for the systemic failure should see the consequences, not the person responsible for the symptom.
> The person responsible for the systemic failure should see the consequences
You don't see the contradiction of terms there? A systemic failure is by definition not the responsibility of one person. You're saying people should be able to make mistakes. But not those people.
Having worked in a very large, bureaucratic company I can say that I strongly suspect that just ignoring systemic failures as learning opportunities is also not a sustainable. Too many times I’ve had to yield on something and say “I guess they’ll learn when this fails” only to see them easily move on or get promoted before the failure occurs. They don’t learn their lessons.
I suspect the solution is to find a way to make sure the consequences of the decision are fed back properly into the system directed at the right person. How to do that, I have no idea.
It's not that you ignore them. But your first step if someone makes a mistake should not be to fire them. Maybe they need some coaching, maybe they have too much access or authority, but everyone makes mistakes. The key is whether or not they learn from them, or keep making them.
A piece of the system (a junior developer) is allowed to make mistakes. The person responsible for architecting and protecting the system (the CTO)... less so.
That depends, this might have been this CTO’s first time as CTO. Without knowing the story, they could so have been pretty far out of their element and just lucky to be a founder or something.
Even C-level people always have to have their first day as C-level and of course they will make mistakes.
The important thing is learning from them of course.
Well, one could argue that "giving access to production databases to people who shouldn't have it" was just "making mistakes in the course of their work" for this dude.
I’ve never understood the logic of firing someone over a mistake like that. They’re now the person least likely to make a similar mistake and they will maintain the institutional knowledge to help ensure it doesn’t happen again.
Easy: (1) He wasn't my boss. (2) He allowed a person not associated his team or even the tech department to conduct potentially harmful operations on the production database without supervision. (3) Those actions resulted in millions of dollars of lost revenue and make-goods. (4) He did not coach the person who brought the database down. (5) He repeated the mistake.
How? AFAIK a single join can be at most quadratic, and multiple joins should at most be polynomial, where the exponent is the number of joins. To go exponential, you'd need some kind of recursion or self reference, but I know no way to express such a thing as an ordinary join statement.
(of course quadratic performance is already prohibitively slow on large tables, so there is no need to go exponential in order to take "forever")
In our software, a minority codepath sometimes reported database deadlocks. Nothing critical but it littered the ops error logs and probably displayed error messages to a few customers. So I added a pessimistic exclusive lock to a query which basically solved the deadlock problem (not a great solution but it worked). However, what I missed was the query, even though in a minority code path, was touching another table used basically in all hot-path queries. So basically the code seemed to work fine until deployed to all servers when all operations of the whole cluster got basically serialized through this single lock.
So, yeah, database locks can bite you hard!
Not as bad as yours, but MySQL and also blocking prod table: on my first job after graduate, I once run a delete commands on about 20 rows on a quite large table (maybe 500M+ rows), but was causing deadlock because of gap lock, it has been 6 years so I don't really remember the details.
I was not expert but knew about MySQL optimization at that time, but it looks like sometime you just do things and not think through.
15 minutes later, sysad team PM me and ask WTF am I doing, and I realized what happen.
Someone hogging the database with an analytics query is a honest error because of an insidious footgun inherent in the technology stack. On the other hand, the CTO permitted access to the production database ... why? To learn MySQL, it would have been sufficient to set up a local instance. Or connect to testing/staging environments to get at some data.
I was trying to learn MySQL and the CTO made the mistake of giving me access to the prod database. This huge network that served most of the ads in the world ran off of only two huge servers running in an office outside Los Angeles.
MyISAM uses a read lock on every SELECT query. I did not know this at the time. I was running a number of queries that were trying to pull historical performance data for all our ads across all time. They were taking a long time so I let them run in the background while working on a spreadsheet somewhere else.
A little while later I hear some murmuring. Apparently the whole network was down. The engineering team was frantically trying to find the cause of the problem. Eventually, the CTO approaches my desk. "Were you running some queries on the database?" "Yes." "The query you ran was trying to generate billions of rows of results and locked up the entire database. Roughly three quarters of the ads in the world have been gone for almost two hours."
After the second time I did this, he showed me the MySQL EXPLAIN command and I finally twigged that some kinds of JOINs can go exponential.
Kudos to him for never revoking my access and letting me learn things the hard way. Also, if he worked for me I would have fired him.