Why Key-Value stores are like C, and why you might want to use one anyways

fauigerzigerk · on Jan 4, 2010

I think you're spot on when you talk about a tradeoff between performance and productivity. But what I would take issue with (just a tiny bit) is that your test scenario doesn't really show the extent of lost productivity.

Your test case is ideally suited to what key/value stores are good at. Once you need joins the whole picture is going to look very differently. Thankfully you do acknowledge that in your post.

There is another issue as well. You can get good performance from key/value stores only if you know upfront what kinds of queries you're going to need. For ad hoc queries you need a query optimizer that takes statistics into account in order to make use of indexes in the best possible way. RDBMS have very sophisticated query optimizers.

One other difficulty arises when you try to make this architecture work in a multi-process scenario. Berkeley DB does support that in principle, but it's very difficult to make transactions, shared caches and recovery work reliably with processes that are started by apache modules.

jimbokun · on Jan 4, 2010

I like that you present the trade-offs involved in choosing a relational DB or a key value store. Key value stores can go faster because they are lower level, like C compared to higher level programming languages. But this also means they give you fewer features, and you need to write more lines of code to get the same functionality, and if you're not careful you are more likely to have bugs or data corruption.

Most of the other things I have read on this topic have taken a position of "Relational good, Key-value bad" or the opposite. Your take is much more useful, with some back of the envelope data to back it up.

newhouseb · on Jan 4, 2010

"Most of the other things I have read on this topic have taken a position of "Relational good, Key-value bad" or the opposite."

This is part of a rift between the people writing the posts and the people spending their time building Facebooks, Googles, etc. Almost everyone I've talked to who operates (not just plans to) at HUGE scale relies on a combination of key-value stores (commonly known as memcached) and relational databases. That way you get all the speed benefits of something like BDB, without having to re-invent alternatives to SQL queries in [your language of choice] wrapping BDB.

uggedal · on Jan 4, 2010

While it's easier to achieve high performance and scalability with key-value stores, they also make it easier to create highly available applications. Replication, failover, and multi-master support are abundant in the popular key-value offerings. This was the reason I eventually decided to use Tokyo Tyrant for http://wasitup.com

papaf · on Jan 4, 2010

I was interested to see that the simple Mysql reads in the benchmark are slightly faster than the Berkeley DB ones. Has Oracle broken Berkeley DB that badly or is Mysql particularly fast?

smanek · on Jan 4, 2010

I suspect there are a few issues at play. First, with BDB I have to deserialize the entire object to get the UID - while I tell MySQL I'm only interested in 1 column. Second, with the simple query I'm literaly just traversing a b-tree - which is the exact same thing that mysql does since I told it to build an index on the relevant column (except with BDB I do so in Lisp instead of raw C).

gaius · on Jan 4, 2010

I think a more representative test would be BDB vs SQLite, since they are architecturally more similar (e.g. there's no SQLite "daemon", it's a C library).

kscaldef · on Jan 4, 2010

Reads in MySQL are quite fast.

smanek · on Jan 4, 2010

Hello!

Part 3 of a series of articles I've been writing about the technology behind http://postabon.com. This is about the persistence layer (Elephant) I'm using - but I tried to make it a little more general so it could help non-Lisp programmers.

billswift · on Jan 4, 2010

Forgive my ignorance, but I never programmed much and not at all in several years, so I have been reading Learning Perl as a refresher. In what way(s) is a Key-Value database different from a hash, or set of hashes? According to the book, a hash is a set of key-value pairs. I assume the database is normally several hashes with the same keys for finding values, and the software for associating the values from multiple hashes, but is there more to it?

lucifer · on Jan 4, 2010

The difference is persistence provided by a "store", not the semantics of the data space (which is a mapping from keys to values). Further, a hash is one of the alternative implementation strategies for a key-value space; your typical in-memory "database" will likely opt for hashes, while issues regarding disk access latencies favor tree structures.

billswift · on Jan 4, 2010

Thanks, that makes sense.

billswift · on Jan 4, 2010

I read a lot of the NoSQL posts that were on HN a few months back, but they didn't have anything but programming details which I couldn't really follow or were not much more than adverts and "wow, isn't this great" stuff. Anyone know of a source, preferably a book, but web will do, that describes the principles of the newer databases as opposed to relations?

bitdiddle · on Jan 4, 2010

good post. Another aspect of these simple key-value stores is that the lack of schema is sort of like late binding for design. More work needs to be done in the client layer but the tradeoff is maximum flexibility in evolving the application. The O-R mapping problem disappears