More

jivid · on July 6, 2017

Super interesting post. Would love to read more detail about their backup and restore infrastructure.

If Tom and/or Shlomi are reading this: you mention taking multiple logical backups per day. What benefit does this bring versus just having one per day and doing a point-in-time restore using binlogs? Is this just a tradeoff between time taken for a restore and storage you're willing to dedicate to backups?

Disclaimer: I work on Facebook's MySQL backup and restore system (https://code.facebook.com/posts/1007323976059780/continuous-...)

shlomi-noach · on July 6, 2017

@jivid the logical backups are done per-table, not per-server. Per-table logical backups are useful to the engineers owning the data. It makes it easy for them to restore data from a single table.

When an engineer loads logical backup data, it loads into a non-production private zone where the engineer has access to the data, and can then make informed decisions on whether there is need to re-apply data changes (due to bug, due to need to review historical data, etc.).

This of course has the advantage of quicker restores (only need a single table), and this happens to cover the vast majority of cases. This doesn't cover the case where we need to restore consistent data for two or more different tables.

jivid · on Feb 1, 2017

Author here.

We use it because it works well for us. We've put a lot of work into making MySQL scale for us to the point where it's a very well supported system and one of the main choices for a lot of storage decisions.

We even use MySQL as a queue for Facebook Messenger. More details about this:

https://www.youtube.com/watch?v=eADBCKKf8PA

https://code.facebook.com/posts/820258981365363/building-mob...

elvinyung · on Feb 1, 2017

Thanks! Standardizing on MySQL because of internal expertise is definitely a great reason.

hamandcheese · on Feb 1, 2017

GitHub also migrated persistent data out of Redis and in to MySQL, with their expertise in MySQL as one of the motivating factors.

https://githubengineering.com/moving-persistent-data-out-of-...

jivid · on Oct 28, 2016

We take logical backups, not physical, and mysqldump is the best option for that. Having logical backups means we can do logical diffs as well (see https://youtu.be/Fe2oLZ4CWD4?t=951), and we've added table checksum support to mysqldump in our branch of MySQL (https://github.com/facebook/mysql-5.6/commit/54acbbf915935a0...)

mathnode · on Oct 29, 2016

Is the FB branch version of mysqldump still single threaded? How do you cope with that?

I currently "fake it", using "START TRANSACTION WITH CONSISTENT SNAPSHOT", with multiple mysqldump processes running, where I can't get mydumper deployed.

jivid · on Oct 29, 2016

We run a single mysqldump with --single-transaction for each database on the server, nothing special.

mathnode · on Oct 29, 2016

Thanks, how big is a single instance? All out to a single file?

jivid · on Oct 30, 2016

Instance size varies a lot because we've got a lot of different MySQL workloads, some with very different configurations.

Remember we're backing up each database separately though, not the entire MySQL instance at once. Each database's backup is in a separate file.

mathnode · on Nov 4, 2016

Thanks

wanderr · on Oct 29, 2016

xtrabackup does differential backups as well...is there another advantage to doing logical backups? are the diffs significantly smaller?

jivid · on Oct 30, 2016

In addition to what evanelias said, a logical dump also means we can load it into a MySQL instance running a different storage engine as well. In our case, it allows us to take a mysqldump from an InnoDB instance and load it into a MyRocks instance if we wish.

evanelias · on Oct 30, 2016

Yes, logical backups are smaller due to lack of index overhead. And since logical backups are textual, they can also be used for other clever purposes, such as ETL pipelines.

jivid · on Oct 28, 2016

There's a few different ways to verify, a few of them involving stopping replication, like you pointed out. These can also sometimes be quite expensive, so depending on the type of verification required, the verification method can be tuned.

We also implemented table checksums inside of mysqldump, allowing us to dump out the restored data and compare checksums as well, if required. https://github.com/facebook/mysql-5.6/commit/54acbbf915935a0...

predakanga · on Oct 28, 2016

Very nice, thanks for the link!

I'll definitely be seeing if I can replace my system with that; reducing the overhead to a single transaction would be a big win.

jivid · on Oct 28, 2016

> Do you diff against the same base, or create an incremental chain? How many diffs do you take in between recapturing a full image? At $DAYJOB we always take full backups into a fast in-house deduplicating store.

We always diff against the same base and have 5 days in between subsequent full dumps. The number of days just comes from a trade off between space occupied by the backups and time it takes to generate them.

> Is there no better way to handling this than polling?

There's definitely different ways to approach this, we find polling works well for us. We also use the same database for crash recovery, so doing the assignments through it serves both purposes.

> Presumably you can only get this parallelism by disabling FK integrity. Is it re-enabled in the following VERIFY stage?

I'm not sure what you mean by parallelism through disabling FK integrity. Splitting the backup into its tables means we can restore a subset of tables instead of the entire backup. This allows us to load individual tables concurrently, but also not have to wait to load a massive database if all we need is a few small tables.

TimWolla · on Oct 28, 2016

> I'm not sure what you mean by parallelism through disabling FK integrity.

Say you have a `user` table and a `post` table with `post.user_id` being a FOREIGN KEY on `user.user_id`. Without disabling FK integrity you would not be able to restore a post without restoring the user first. When restoring in parallel this might or might not work out.

evanelias · on Oct 28, 2016

Facebook (along with almost everyone else using MySQL at massive scale) doesn't use foreign keys.

They scale poorly in MySQL, and they lose a lot of purpose in a massively sharded environment anyway. For example, say you like a status post on Facebook, or friend another user. It's very unlikely that the liked status or friended user exists on the same shard as your account, and there's no way to enforce a foreign key relationship in an inherently non-distributed database like MySQL.

So instead integrity is handled at the application layer, with additional background processes to fix the occasional integrity problem and detect integrity anomalies.

slashdev · on Oct 28, 2016

I understood it to mean that if you restore table A and table B in parallel, if there is a foreign key between them, then referential integrity checks would cause one of the loading operations to fail. How do you deal with that?

danudey · on Oct 28, 2016

It's likely that foreign key checks aren't handled at the RDBMS level, but rather at the application level.

jivid · on July 22, 2016

Facebook has a very extensive Python codebase as well. 21% of all backend code is Python:

https://code.facebook.com/posts/1040181199381023/python-in-p...

edwinnathaniel · on July 23, 2016

I would like to know how much out of 21% are for scripting/shell/utility, how much for services, how much are truly part of the real products.

Calling scripting/shell/utility for "backend" might give wrong ideas to certain crowds who might have thought that backend is more important than front end product sensitive code that 1Billion users interact with everyday.

jivid · on April 14, 2016

It works on top of Quartz Composer, which is OSX only I believe.

jivid · on July 30, 2015

Those addresses are websocket servers that seem to publish wikipedia changes. The app is just subscribing to the changes by opening a websocket connection to the appropriate server.

Looking at their other repos and the addresses you linked to, looks like the project for publishing the changes is at https://github.com/hatnote/wikimon, which is in turn getting changes from an IRC feed at https://meta.wikimedia.org/wiki/Research:Data#IRC_Feeds

jivid · on March 27, 2015

> If you have not been explicitly informed by us in a separate communication that we detected suspicious activity involving your Slack account, we are very confident that there was no unauthorized access to any of your team data (such as messages or files)

jivid · on Aug 3, 2014

This could be because Github's having issues with serving archives. See https://status.github.com/

EDIT: Should be working now

notduncansmith · on Aug 3, 2014

Ah, didn't notice that. Appears to be working now.