Strategy: Drop Memcached, Add More MySQL Servers

mdasen · on Aug 10, 2008

I would tend to disagree with that strategy.

First, memcached is fast. Allocations are made in O(1) time (think: scales infinitely) and is non-blocking. Anytime you want to look up an object, you just call for it.

With sharding, the process is more difficult. First, you query for the shard of the object, then you query that shard. OK, not too bad. What happens when you want to get all the friends of Adam? Well, you query for Adam's shard, get the list of his friends, then query for each of those friend's shards, then query each of the friend's shards for their data? Ugly, slow, gross! OK, so you denormalize! Only problem with that is that you then have to do a lot more queries on writes (which are the hard part to begin with). Plus, what if you want to search for users named 'Bob'. Well, typically sharding involves a setup like table, pk, shard - to relate an item to which shard it's on. If you're looking up by the primary key, you're gold. Not so much on the other fields. Yeah, you can query each shard, have it send back results and combine them yourself - in fact you could even automate that process in a proxy - but it isn't the most wonderful solution.

That scenario in memcached works much nicer - mostly because memcached is basically a sharded database to begin with. You don't have to look up which memcached server an object is on. Just pass it the ids you want ('person:5', 'person:22', 'person:900') and it will grab them from the appropriate shards and send them back! The problem is that memcached doesn't keep multiple copies so it can't be used for persistence. No problem, MySQL will handle that!

What's really missing from the whole debate is that when you get to a certain point in your application, you just don't get to query like you used to. With memcached or sharding, you're going toward limiting yourself to keys - somewhat similar to Google's BigTable (a dumb store).

With AppEngine, Google took a huge step to making this more usable to programmers. By adding an index on each field (with the option for indicies on multi-field lookups), AppEngine allows you to have a sharded database without sacrificing some querying capability. So, the big question is, why isn't someone developing a system that works like that - spews your data all around, but keeps an index available for some query capability? In fact, AppEngine's datastore queries work shockingly similar to views in CouchDB - look at how scanning a range is done in each and you'll get an idea.

In the meantime, MySQL/memcached in combination allow me to take advantage of both sharding and querying with relative ease. So, I save Bob as a friend of Adam in MySQL, then I update the entries in memcached for each of them by querying MySQL for their records and friend_ids and put {'id': 12, 'name': 'Adam', 'friends': [1, 212, 78, 51]} and {'id': 212, 'name': 'Bob', 'friends': [12, 491, 51, 999]} into memcached. When I want Adam and his friends, I run two memcached calls - give me 12, give me (1, 212, 78, 51).

That isn't perfect, but it can be a lot easier than sharding. At some point, it becomes requisite to shard, but it's not fun so why not use memcached along the way? "Don't prematurely optimize" comes to mind.

newt0311 · on Aug 9, 2008

Better strategy: drop MySQL and move to postgres.

djist · on Aug 10, 2008

This seems to be the mantra of the PostgreSQL fanboys.

I use PostgreSQL. I have found, in my experience, that it has superior join performance when compared with MySQL and that's what is important to be in my line of work. That said, I'm not silly enough to think that performance = scalability. Neither MySQL or PostgreSQL scale naturally. They require strategies to be implemented whether that be sharding, caching, replication, etc.

This just isn't a useful suggestion for scaling.

kingkongrevenge · on Aug 10, 2008

There is so much ink spilled about going through conniptions to scale mysql and postgres. I really wonder if anyone has actually bothered to confirm that it's not better to just buy some big iron and run db2 or oracle once you hit scalability problems. There are a ton of corporate data centers running oracle/db2 handling WAY more throughput than your web app ever will, and they're not using memcached or sharding.

gaius · on Aug 10, 2008

You might be surprised about the number of database "experts" who have never heard of Postgres. They know about MySQL because Slashdot uses it, and they vaguely know about Access, tho' they've never used it and don't know what it's good for.

Sharding, for instance, is a hot topic right now but "big iron" DBAs like myself are completely mystified... Isn't this just the partitioning/shared nothing feature that we've had for over a decade? (Answer: yes of course it is).

DanielBMarkham · on Aug 10, 2008

It's a build-buy decision, right?

When you're seven guys sitting in a garage eating pizza, you build it because it makes the most sense.

If you've got a million records going through an hour and you're making the big bucks, get the commodity solution as close to wholesale as you can.

The problem is that if you're giving away your app for free, you're spending dough -- perhaps lots of it. Then you're stuck on the back-end of the curve, scrambling to tweak every little bit you can.

kingkongrevenge · on Aug 10, 2008

> It's a build-buy decision, right?

Exactly, but the entry level DB2 and Oracle solutions are not super cost prohibitive and time is money. We're talking about a pretty low man-hour investment in futzing with MySQL before you'd have been better off calling an IBM rep. If you're doing some popular free web app maybe you can stick a "powered by IBM" icon in the corner and get the stuff for free. Remember those?