Super interesting post. Would love to read more detail about their backup and restore infrastructure.
If Tom and/or Shlomi are reading this: you mention taking multiple logical backups per day. What benefit does this bring versus just having one per day and doing a point-in-time restore using binlogs? Is this just a tradeoff between time taken for a restore and storage you're willing to dedicate to backups?
@jivid the logical backups are done per-table, not per-server.
Per-table logical backups are useful to the engineers owning the data. It makes it easy for them to restore data from a single table.
When an engineer loads logical backup data, it loads into a non-production private zone where the engineer has access to the data, and can then make informed decisions on whether there is need to re-apply data changes (due to bug, due to need to review historical data, etc.).
This of course has the advantage of quicker restores (only need a single table), and this happens to cover the vast majority of cases. This doesn't cover the case where we need to restore consistent data for two or more different tables.
We use it because it works well for us. We've put a lot of work into making MySQL scale for us to the point where it's a very well supported system and one of the main choices for a lot of storage decisions.
We even use MySQL as a queue for Facebook Messenger. More details about this:
Is the FB branch version of mysqldump still single threaded? How do you cope with that?
I currently "fake it", using "START TRANSACTION WITH CONSISTENT SNAPSHOT", with multiple mysqldump processes running, where I can't get mydumper deployed.
In addition to what evanelias said, a logical dump also means we can load it into a MySQL instance running a different storage engine as well. In our case, it allows us to take a mysqldump from an InnoDB instance and load it into a MyRocks instance if we wish.
Yes, logical backups are smaller due to lack of index overhead. And since logical backups are textual, they can also be used for other clever purposes, such as ETL pipelines.
There's a few different ways to verify, a few of them involving stopping replication, like you pointed out. These can also sometimes be quite expensive, so depending on the type of verification required, the verification method can be tuned.
> Do you diff against the same base, or create an incremental chain? How many diffs do you take in between recapturing a full image? At $DAYJOB we always take full backups into a fast in-house deduplicating store.
We always diff against the same base and have 5 days in between subsequent full dumps. The number of days just comes from a trade off between space occupied by the backups and time it takes to generate them.
> Is there no better way to handling this than polling?
There's definitely different ways to approach this, we find polling works well for us. We also use the same database for crash recovery, so doing the assignments through it serves both purposes.
> Presumably you can only get this parallelism by disabling FK integrity. Is it re-enabled in the following VERIFY stage?
I'm not sure what you mean by parallelism through disabling FK integrity. Splitting the backup into its tables means we can restore a subset of tables instead of the entire backup. This allows us to load individual tables concurrently, but also not have to wait to load a massive database if all we need is a few small tables.
> I'm not sure what you mean by parallelism through disabling FK integrity.
Say you have a `user` table and a `post` table with `post.user_id` being a FOREIGN KEY on `user.user_id`. Without disabling FK integrity you would not be able to restore a post without restoring the user first. When restoring in parallel this might or might not work out.
Facebook (along with almost everyone else using MySQL at massive scale) doesn't use foreign keys.
They scale poorly in MySQL, and they lose a lot of purpose in a massively sharded environment anyway. For example, say you like a status post on Facebook, or friend another user. It's very unlikely that the liked status or friended user exists on the same shard as your account, and there's no way to enforce a foreign key relationship in an inherently non-distributed database like MySQL.
So instead integrity is handled at the application layer, with additional background processes to fix the occasional integrity problem and detect integrity anomalies.
I understood it to mean that if you restore table A and table B in parallel, if there is a foreign key between them, then referential integrity checks would cause one of the loading operations to fail. How do you deal with that?
I would like to know how much out of 21% are for scripting/shell/utility, how much for services, how much are truly part of the real products.
Calling scripting/shell/utility for "backend" might give wrong ideas to certain crowds who might have thought that backend is more important than front end product sensitive code that 1Billion users interact with everyday.
Those addresses are websocket servers that seem to publish wikipedia changes. The app is just subscribing to the changes by opening a websocket connection to the appropriate server.
> If you have not been explicitly informed by us in a separate communication that we detected suspicious activity involving your Slack account, we are very confident that there was no unauthorized access to any of your team data (such as messages or files)
If Tom and/or Shlomi are reading this: you mention taking multiple logical backups per day. What benefit does this bring versus just having one per day and doing a point-in-time restore using binlogs? Is this just a tradeoff between time taken for a restore and storage you're willing to dedicate to backups?
Disclaimer: I work on Facebook's MySQL backup and restore system (https://code.facebook.com/posts/1007323976059780/continuous-...)