The Day the Replication Died

thaumaturgy · on April 10, 2013

Nice writeup. I've been wrestling with MySQL replication for a few months, too. I have a multi-master setup with an additional slave off-site that stores backups.

I've switched entirely to row-based replication because it offers one really nice feature: in the event of a desync, I can resync the slaves with a query like this on the master,

    create table new_table like bad_table;
    insert new_table select * from bad_table;
    drop table bad_table; rename table new_table to bad_table;

...and bam, everything's groovy. On a VPS, I can have the slaves resync'd within a few minutes this way on tables with > 1m rows.

The other pretty important thing to have is fast notifications of trouble. It sucks waiting for the slaves to catch up on the binlog from the master. I wrote a shell script that runs from cron every few minutes, checks the replication status on all hosts, and lets me know if there's trouble.

Being a few days behind in replication can take hours to catch up; being a few hours behind only takes a few minutes to catch up.

I'm really looking forward to trying out Percona XtraDB-cluster and seeing if they handle any of this stuff better than stock MySQL.

jyap · on April 11, 2013

Yes, that should also work for Mixed replication.

The rule for Mixed replication is that Statement based replication is performed UNLESS the statement is non-deterministic. Eg. If you are performing an update on a random row ID then a Row-based statement is performed on the slaves.

robbles · on April 10, 2013

Could you explain why that technique is only viable for row-based replication?

It looks really useful and I don't quite understand enough about the replication process to see why you couldn't rely on this for mixed replication as well.

thaumaturgy · on April 10, 2013

It might work for mixed too, but I'm not sure. Honestly, I read http://dev.mysql.com/doc/refman/5.6/en/binary-log-mixed.html a few times and my eyes glazed over by the end every time.

Finally I just decided that I could expect it to work forever with row-based, but I was less certain about whether it would work in mixed, or whether it might change in future versions of MySQL. I consider "I'm not sure what will happen" to be a scary answer in sysops, so I went with row-based.

blantonl · on April 11, 2013

this should work in a mixed setup, remember replication in MySQL is pretty much a replay of SQL statements.

toast0 · on April 11, 2013

I don't know how mixed mode decides to use row or query based for a given query; but if it chose query based, the queries would work, but you wouldn't be resynchronized.

(In query based mode, I've done resynchs by iterating over each row, and setting a not super important column to a different value, and then back, including values for all the columns that need to be resynchronized. It's a lot more queries, but it's also easier to rate limit if your tables are big enough that doing it in two queries is going to back everything up)

blantonl · on April 11, 2013

I'd love to see your shell script for checking a slave status.

thezilch · on April 11, 2013

Consider the Percona Toolkit: http://www.percona.com/software/percona-toolkit

thaumaturgy · on April 11, 2013

Sure. It was a little fiddly to set up but nothing too fancy:

    #!/bin/sh
    
    # MySQL Status check module
    
    # Retrieve mysql status from target server
    statusinfo=`sudo -u tron ssh "tron@$server" "sudo tron mysqlpass \"$mysqlpass\" mysqlstatus"`
    
    # Status of Slave_IO_Running
    status_io=`echo "$statusinfo" | grep -o -e "Slave_IO_Running:[[:space:]]*[[:alpha:]]*" | cut -d ' ' -f 2`
    
    # Status of Slave_SQL_Running
    status_sql=`echo "$statusinfo" | grep -o -e "Slave_SQL_Running:[[:space:]]*[[:alpha:]]*" | cut -d ' ' -f 2`
    
    # Seconds behind master
    status_sec=`echo "$statusinfo" | grep -o -e "Seconds_Behind_Master:[[:space:]]*[[:alnum:]]*" | cut -d ' ' -f 2`
    
    # Verify status, send notification if it's bad.
    status_ok=1
    
    if [ "$status_io" != "Yes" ]; then
        status_ok=0
        echo "Slave_IO_Running on $server: $status_io\n\nSlave status for $server:\n$statusinfo" | mail -s "[Tron] ALERT: Slave_IO_Running on $server" rob@associatedtechs.com
    fi
    
    if [ "$status_sql" != "Yes" ]; then
        status_ok=0
        echo "Slave_SQL_Running on $server: $status_sql\n\nSlave status for $server:\n$statusinfo" | mail -s "[Tron] ALERT: Slave_SQL_Running on $server" rob@associatedtechs.com
    fi
    
    if [ "$status_sec" -gt "300" ]; then
        status_ok=0
        echo "Seconds_Behind_Master on $server: $status_sec\n\nSlave status for $server:\n$statusinfo" | mail -s "[Tron] ALERT: Seconds_Behind_Master on $server" rob@associatedtechs.com
    fi
    
    if [ "$status_ok" -eq "1" ]; then
        echo "OK"
    fi

This is a module from a set of scripts that I collectively refer to as "tron" (they "fight for my users" :-). The "mysqlstatus" command on the remote server gets routed to a tron client that ends up running,

    mysql_status ()
    {
        echo "show master status\G; show slave status \G" | mysql -u root -p"$mysqlpass"
    }

It's probably a bad idea to pass the mysql root user password as a parameter in a command (you have to be certain it's not getting logged anywhere), so don't do that if you decide to make something similar. I prefer it to having a readable file on the server with the password, but it would probably be better yet to set up a specific MySQL user with limited access to specific stuff. It's on my to-do list.

falcolas · on April 11, 2013

You can use a .my.cnf in the monitoring user's home directory to hold the login and password for all of you mysql actions.

thaumaturgy · on April 11, 2013

Oh! Excellent. I didn't know that. Thank you.

sparkman55 · on April 10, 2013

Can someone point to a big postgres replication failure event? I know mysql is getting better, but I still see public failures like this much more frequently than big postgres (or Oracle) failures.

falcolas · on April 10, 2013

PostgreSQL uses essentially MySQL's ROW based format (that is, it replicates the actual table changes instead of the statements). If you use purely row based replication, you likely wouldn't see this in MySQL either. Of course, since TANSTAAFL, your network traffic between nodes is much higher, and the disk storage requirements grow significantly (for the binary logs).

Also, RDS doesn't offer anything but mixed format replication for slaves, and DRBD for HA.

The best middle ground when ROW based replication isn't an option is to check for inconsistencies periodically using something like pt-table-checksum, and fix them when they're found with something like pt-table-sync.

[EDIT] Can you please explain the downvotes?

larrik · on April 11, 2013

I don't know about the downvotes, but using anything non-default in MySQL is dangerous because the devs don't test well enough.

I've had enough troubles just with transactions to never consider MySQL for a serious project ever again.

falcolas · on April 11, 2013

> I've had enough troubles just with transactions

What kind of troubles? It's a serious question; I work with InnoDB transactions on a daily basis, and if I can find more real issues, I can usually get them resolved (or explain the behavior).

MySQL is a very powerful and performant DB (and, contrary to seeming popular opinion, 100% production ready), but it's not always obvious why things work the way they do.

tene · on April 10, 2013

"Getting better" isn't entirely relevant to this specific case, as the relevant bug was introduced in 5.5.6. Personally, I've had quite a few horrible experiences with mysql replication interacting poorly with various mysql features, and I've lost more time than I'd care to admit trying to figure out workarounds. MySQL wasn't designed with data integrity in mind, and that's had significant influence on its development and ecosystem.

Answering your actual question, no, I've never heard of a big postgres replication failure event, but that's also plausibly explained by postgresql's smaller market share and visibility.

nknighthb · on April 10, 2013

It may also have something to do with the fact that the first version of PostgreSQL to include replication did not emerge until late 2010. Prior to 9.0, it was logically impossible for PostgreSQL itself to have a "replication failure".

tene · on April 10, 2013

Yes, that's certainly relevant as well, although I disagree with some of the implications of "logically impossible" due to the third party postgresql replication products, although I don't have personal experience with them.

atombender · on April 11, 2013

I doubt you will see anything similar with Postgres, simply because it's too strict and conservative about how it replicates.

Postgres replicates transactions rather than rows or statements. Postgres uses a transaction log, aka xlog (also called the write-ahead log, or WAL), and this is streamed to slaves where the xlog entries are replayed. It's analogous to Oracle's redo log. Each log entry is information about what tuples to update (or insert, or delete), and what to update with.

This means that the replication data flow is basically the same as the data flow that occurs on a single server when clients execute SQL. Something like "update foo set a = 1" will result in a log entry that describes how the physical database files are updated; therefore, replication is a matter of applying the same log entry on the master as on the slaves. (This system is also used to implement streaming backups: You can reconstruct the database at any point in time simply by replaying old transaction logs up to the point in time that you want.)

Everything that involves state change is encapsulated by transaction logging. Postgres has transactional DDL -- in other words, you can do things like "create table", "alter table", "drop table" etc. in transactions -- precisely thanks to this symmetry. It also means that the replication state is unaffected by context: Things like sequences and timestamps are made consistent because they are, by necessity, already calculated by the time the transaction log is written.

So with Postgres' replication it's virtually impossible to end up in a situation like Kickstarter's, where duplicate IDs are replicated. Of course, physical corruption or some weird bug could cause this, but at least the latter is very unlikely, and if it did happen it would very likely hit more than just replication, again because of how the xlog is so central to the entire system. In other words, an xlog bug that only happened to replication would be fairly rare.

thaumaturgy · on April 10, 2013

Sure: stock Postgres doesn't have multi-master capability at all. Postgres-XC does (http://postgres-xc.sourceforge.net/), but I didn't know about that when I finally decided to move from a mix of MySQL and Postgres to MySQL only.

sparkman55 · on April 10, 2013

From the article:

> the replicas were out of sync with the master

Most replicated relational database environments are single-master environments; this mirrors the read-oriented traffic of many database use cases.

Most multi-master solutions require you to leave parts of SQL and ACID behind - the advantage is significantly better write scalability and write availability.

Mysql replication has been around for ages; it seems to blow up spectacularly on occasion. Postgres replication is relatively new, and I'm wondering if it also has similar issues. Anecdotal evidence suggests that it has not: I have never been awakened by panicked developers/operators because of postgres replication, but mysql replication has caused more than one sleepless nights.

stcredzero · on April 10, 2013

> Mysql replication has been around for ages; it seems to blow up spectacularly on occasion.

For us "old school" guys who have been programming for awhile, this is truly eyebrow-raising. Hearing that DB replication "seems to blow up spectacularly on occasion"...there's something wrong with this picture. A tool like DB replication should be at a much higher level of reliability than application code. It also confirms a lot of the grousing I've heard about the engineering hubris of MySQL over the last decade.

Also reminds me of Alan Kay's quip about what makes programming, "not quite a field."

thaumaturgy · on April 10, 2013

I want to rush to MySQL's defense here, but I can't. If we forget for a moment about the history of it or the engineering challenges specific to MySQL and so on, what we have is an infrastructure application that one-way copies data to a remote instance of the same application, with almost no error handling and very little consistency checks, and that halts the data copy in the event of an error, without notifying anybody that there's a problem.

MySQL replication feels like a hack, not the sort of thing that people can use as part of a reliable infrastructure.

My MySQL replication wishlist would be:

1. bidirectional communication protocol so that slaves can ask the master for a fresh copy of some particular data in the event of an error;

2. built-in notifications for anything that might make an alcoholic out of a sober sysadmin;

3. periodic idle-time consistency checking (master: "I have X tables with Z definitions and N rows each"; slave: "something is wrong with my copy of Y, I need rows 1 - 100").

I have a couple of projects in my pipeline right now that are being held up entirely by the feeling that MySQL is not yet reliable enough and I need to build better monitoring and automated error-handling systems first.

falcolas · on April 11, 2013

What you're describing is essentially any flavor of Galara Cluster for MySQL. Unfortunately, it comes with its own tradeoffs.

However I do see something a bit... off... about your 2nd concern. Why would you want a database to handle its own monitoring? I'd personally rather just set up a set of MySQL monitoring plugins into Nagios. That way all of my monitoring is one place.

Consistency checking across terabytes of data on multiple servers is a hard problem. It would be great if it could be solved, but I'm not holding my breath for them (well, any DB vendor for that matter) to get it right.

thaumaturgy · on April 11, 2013

I've had issues with Nagios in the past and am not its biggest fan. Centralized monitoring is nice, but infrastructure-critical software shouldn't require people to add on monitoring applications IMO.

falcolas · on April 11, 2013

Do you have examples of other critical software incorporating their own monitoring solutions, such as Apache, PostgreSQL, even Linux?

How would you monitor memory, disk and load usage?

Sure - Nagios has its issues, but there are quite a few alternatives that can do the same thing, all of which can interface with your database for monitoring.

stcredzero · on April 11, 2013

I suspect your statement could be templated:

MySQL <X> feels like a hack, not the sort of thing that people can use as part of a reliable infrastructure.

zimpenfish · on April 11, 2013

"old school" guys really shouldn't be raising eyebrows at MySQL blowing up in dumbass ways; they should be quietly congratulating themselves - yet again - on abandoning that inept piece of idiocy years ago.

atombender · on April 11, 2013

Anecdotal data point: No problems. It's been rock stable for a couple of years now. Even though we have 22 databases on our master, the streaming replication overhead is practically zero, and the latency is usually in the order of milliseconds.

We did have a weird replication failure at one point when running pg_repack. It's a tool that reclusters tables by duplicating them (essentially writing a new table sequentially) and then renaming the duplicate back to the original table name. This caused the slave to suddenly try to access a table that had not been created on its end yet. But it's possible that it was due to misconfiguration; we are currently investigating it so we can collect enough to data to maybe submit a bug report.

Postgres-XC looks interesting, but I have not heard of anyone using it in production yet. Would love to hear if anyone has used it for anything serious.

tekacs · on April 10, 2013

The article doesn't seem to suggest that multi-master was being used here, rather that many clones were being replicated from a single master - am I missing any reason to believe otherwise?

thaumaturgy · on April 10, 2013

Nope. And indeed, maybe Postgres would work better than MySQL for Kickstarter. I don't know anything about their architecture.

I was just responding to the question about Postgres with an example where I found Postgres lacking.

druiid · on April 10, 2013

I'm still trying to find time for a writeup of how we use MySQL Galera, but I'll take this chance to note again that any big MySQL houses out there should really take a look at it and see if it will work within your environment. Basically it's a true multi-master MySQL environment with shared-nothing (kind of like MySQL cluster). We've run into our share of bugs, be we're perhaps more of an edge case. We create a TON of temporary tables and were encountering a bug which was eating up memory slowly. That has been fixed for a couple releases now and everything has been groovy.

Basically though, the cluster takes care of ejecting bad slaves so you'll never need to worry about the replication status, etc.

Oh, and feel free to contact me through profile if you have any questions about it. We have a good long use of it now and pretty much know about most/all of the things to watch out for.

falcolas · on April 10, 2013

The biggest problem with Galera cluster right now is that you can't have large, long lived transactions. Due to the way that the non-write nodes verify data consistency, large transactions can cause some serious slowdown issues for the entire cluster.

druiid · on April 10, 2013

True and that would certainly qualify under an 'edge case' scenario, although if you increase the number of threads I believe it should mitigate this particular issue somewhat (more threads open to do other applies at the same time). Obviously if you need to have dozens of long writes at the same time this isn't solved for.

SergeyHack · on April 15, 2013

It would be really interesting to read your writeup on MySQL Galera usage.

falcolas · on April 10, 2013

Ugh. A very painful occurrence. A couple of recommendations, in the event that Kickstarter reads this thread:

- Run pt-table-checksum on a daily or weekly cron, and fix what it finds immediately.

* Switch off RDS.

  * Your db admins will thank you for taking off the kid gloves

  * You can restore from binary backups quickly.

  * Performance will get better

* You can use something like pt-slave-restart to get replication running quickly and re-sync the DBs after the immediate crisis is over.

* If you don't already have it, hook up Nagios to RDS with some MySQL monitoring plugins like pmp.

Data drift, and the resultant replication downtime, is an unfortunate reality of asynchronous replication. It doesn't have to be multiple hours worth of scramble, however - if you catch it early (with monitoring) and are prepared to handle it (with pt-table-sync or binary backups).

jamesaguilar · on April 10, 2013

Maybe this is a MySQL noob question, but why do they have to do this complicated thing to get a unique id for each project backer?

    CREATE TABLE project_backers (
      # The number they're looking for.
      id INTEGER PRIMARY KEY AUTOINCREMENT,
      user_id INTEGER,  # user_table fk syntax here.
      project_id INTEGER  # project_table fk syntax here.
    )

Insert into this table every time someone backs a project and you get the monotonically ascending number for free.

Again, I'm a total SQL noob, so maybe this isn't really doing what they want.

nknighthb · on April 10, 2013

I suspect they were trying to make the data look reasonably nice for project creators. Your (not incorrect) solution has the downside of ever-increasing numbers quickly reaching into the millions or billions, while their solution has numbers no larger than the number of backers on a project (unlikely to be more than tens of thousands).

They also save the storage cost of going to a BIGINT, but I doubt that was a major factor.

mbell · on April 10, 2013

They mention transactions so I would assume they are using InnoDB in which case you have to have a primary key, if you don't define one InnoDB will either use the first unique not null column it finds or failing that, create one using the internal 6 byte rowID. They could have used a composite primary key on (project_id, backer_id) but that can create other problems such as blowing up the size of your secondary indexes (primary index is included in all secondary indexes in innoDB). So I doubt they were too worried about the storage space compared to other options and probably already had a unique ID column for every row.

If I had to guess, they wanted this column so they could use it to get the number of the backers on any project very quickly. InnoDB's COUNT() performance historically is pretty terrible, especially with large tables, I'm guessing they were trying to avoid that, in addition to being able to provide the project owners with a 'backer ID' that is more human digestible.

That said, the count could be handled with a trigger and there is a trick using LAST_INSERT_ID(expr) for generating the sequence:

Create a column somewhere to hold the sequence value, for example in a backer_sequence_id column in the project table then:

    UPDATE project SET backer_sequence_id=LAST_INSERT_ID(backer_sequence_id+1) WHERE id = ?;
    SELECT LAST_INSERT_ID();   <----returns the sequence ID that was created

It does lock the row in the project table if this is inside a larger transaction but that shouldn't be too hard to avoid. LAST_INSERT_ID is handled per client, so its multi-connection safe. The above works inside a trigger also so the client code wouldn't need to think about it.

jamesaguilar · on April 10, 2013

True, but couldn't you pretty easily derive the logical ID via some sort of pagination scheme? That is, when you really want to see it at all. It seems like this would be a pretty low cost query as long as you set your indices up properly, and I'm assuming for the hundred-thousand backer projects you'll never need to display every backer in order.

Another slight concern is how big this project_backers table would get, since it's a combination. Let's assume every project has an average of 10k backers and there are about 100k projects. That makes 1B relationships, with O 128 bits per row (plus whatever associated data). 16 GB. Not too bad for such an important relationship. Even with 100x growth and no garbage collection or archiving, you could fit it on (a couple of) flash disks for the foreseeable future. I don't know if MySQL supports delta encoding but that would also probably make the table cheaper to store.

nknighthb · on April 10, 2013

You could, but then you have multiple identifiers for the same object, and you end up having to decide "wait, which ID do I want here?", and during long nights when you're hopped up on five cans of Red Bull, the numbers start to run together, and you don't know if you're looking at a real ID or a logical ID, and even when you do know, you end up running queries directly against the database trying to convince yourself that the mapping can't possibly be correct.

Been there, done that, consumed a lifetime supply of Mountain Dew in a year, and handed in a multi-page resignation letter[1]. I'd rather use their solution, which ultimately minimizes the overall complexity of the system.

[1] OK, the resignation letter wasn't really about IDs, but the ID problems were a symptom of larger problems.

jamesaguilar · on April 10, 2013

Interesting. I would expect that the logical id would be displayed only, never stored, but that would make debugging display issues difficult.

Thank you for your thoughts. It's helpful to hear the thoughts of other database professionals, especially in domains where I have no knowledge, like relational databases.

falcolas · on April 10, 2013

autoincrement only works on one database, and is not guaranteed to be unique across multiple schemas.

Depending on their schema and sharding techniques, it's probably easier for them to create the unique ids in the application than rely on the DB.

stcredzero · on April 10, 2013

Statement-based replication and anything involving it just looks like a disaster waiting to happen. There's the definite whiff of hubris here. There are just so many potential ways for something to go wrong.

I'd guess there was a big win in terms of efficiency for this to have been attempted. This is the sort of thing that you'd want formal methods or an environment like Haskell for. Either that, or a decade+ of engineering wizardry with relational systems and some nifty mathematical proofs in your arsenal.

jtchang · on April 10, 2013

These are the kind of issues that I hate dealing with and would make me consider hosting on a platform as a service like Heroku. Does Heroku have auto sharding IDs built in?

stcredzero · on April 10, 2013

If you're using statement-based replication on Heroku, it still won't save you. I have no idea if you can or can't do so. Statement-based replication seems to be one of those things where you can shoot yourself in the foot if you don't design with it in mind.

jrochkind1 · on April 11, 2013

While there are (or at least were) some third-party MySQL add-ons on Heroku, in general Heroku uses postgres instead. I have no idea how/if these issues apply to postgres.

stcredzero · on April 11, 2013

Only in as far as people use statement-based replication on Heroku.

fdr · on April 11, 2013

Heroku offers one database management system right now: Postgres, and its principal replication technology is not based on statement-based replication, and largely for this reason. Using something like PGPool can make the problem come right back again.

So, I will say that the only credit Heroku's staff (which I am a member of, perhaps pertinent to this opinion) gets is for reviewing the technology in play and deciding something like "supporting statement-based replication is basically going to lure a lot of people into a place they don't want to be," and then ignoring statement-based replication entirely. The same general opinion can be said to have been held by PostgreSQL.org, I think. I also want to stress the fact that statement based replication is only but one form of "logical" replication (as opposed to "physical", like the crash recovery logs), and there has been no shortage of interest in the logical replication feature in-the-works (but won't be seen until 9.4 at the earliest) based on crash-recovery log decoding. The design of that should allow exactness to be in principle possible modulo bugs.

The decision to ignore statement based replication has not been without cost: many simple workloads can work with statement based replication, and it would enable support for (with those simple workloads) on-line upgrades to new database versions sooner, but the bizarre (and silent) ways in which statement based replication can break and the seeming impossibility of making statement based replication predictable and exact stayed our hand. Many people fail to read fine print or, sometimes, even big blinking warnings, so inevitably there would be a lot of aggravation for those affected and the staff, who will have to tow a fine line between "told you not to do that..." and digging into a problem that was caused by unsupported use of a feature to assist in fixing it (on a case by case basis, for something like this probably) so everyone can be somewhat more happy. This is no fun for all involved, and when I can see it coming, I try to avoid that position. That said, not everything will be foreseen...but such is software.

tedchs · on April 10, 2013

> For those who love TLA's, here's some of our current stack: AWS, RoR, RDS, EC2, ES, DJ, SASS.

http://www.kickstarter.com/backing-and-hacking/welcome-to-ba...

blantonl · on April 11, 2013

This is a great postmortem about a truly unique MySQL issue.

However, this post left me wondering how they actually recovered from this issue. I'm thinking the only way is to re-provision the replica infrastructure from the master's presumably master data. Or was there another recovery approach?

AaronBBrown · on April 11, 2013

Because MySQL replication is simple, mature, and has a rich toolset, it is usually easy to recover from most replication problems. The Percona Toolkit (formerly Maatkit) has two tools that make this fairly painless - pt-table-checksum and pt-table-sync. The former checks for consistency using MySQLs built-in checksumming abilities and the latter repairs found inconsistencies.

Another useful tool is pt-slave-restart which will limp you along and skip replication errors automatically until you can repair the problem.

neya · on April 10, 2013

Wow, Kickstarter runs on Rails? I never knew...any one have an idea of how many users they have/ how many page views they serve? Just curious...

jonathanjaeger · on April 10, 2013

This is the Kickstarter stats page: http://www.kickstarter.com/help/stats

It doesn't have to do with traffic and insights into how they keep the site up and running with x amount of traffic and users, but it's still interesting to see everything about the projects there.

neya · on April 11, 2013

Thank you for the link :)