I tried to explain to the reporter why Cassandra's data model (particularly that it supports an arbitrary number of columns per row) makes it support denormalization better than traditional rdbmses, and somehow that got turned into "cassandra supports an arbitrary number of rows."
This and a few other poor explanations make me wince reading this. So I don't particularly recommend this article, although I've seen worse. :)
My understanding is that it came down to Cassandra being the only one that was well tested in multi-datacenter configurations. I think there were other reasons, but that was the big one.
Sites similar to twitter are worried about Read/Writes and up time rather than concurrency and locking mechanisms. All flavors of NOSQL databases are best suited for this. Sites like amazon.com, or any transaction processing site can't think of using NOSQL.
tl:dr; NOSQL is good until transaction processing & concurrency comes in to picture.
Amazon pioneered Dynamo, a NoSQL-style scalable distributed key-value store, for building many of their site's internal services upon. It's backed by BDB and can be tuned to guarantee disk writes across multiple nodes if needed. See the paper below.
It certainly limits it from being useful if you don't have dollars to throw at high-end infrastructure, but it's not quite as terrible as it sounds. Effectively all it means is that you can't cold start it. How big a problem this is depends like anything else on the application in question.
As I write this comment I just know that in 5 years I will look at it and chuckle at myself but... it boggles my mind to consider a pure-RAM system that relies on at least one copy of the dataset being available forever in some system's "memory grid".
MongoDB does implicit transactions around single queries/commands on single documents, that can be complex enough to be used to create always-recoverable or always-resumable protocols.
As Jonathan has pointed out, the article has lots of inaccuracies, but it also has a very good explanation of the NoSQL Faustian pact: "Cassandra doesn't do joins... doesn't guarantee referential integrity, where the user knows the data being used reflects the latest updates... can't process transactions, with a guarantee that the transaction will either be completed or discarded, the way relational systems do" because it focuses on "more immediate goals than the pristine data handling rules of relational systems."
As the expression goes: at the poker table, if you don't know who the sucker is, it's you. Guess who's responsible for those fuddy-duddy, old-fashioned things like querying your data, or making sure that you've written data, or that your database isn't corrupted... it's you.
If you're Twitter, and you're fail-whaling every day, then maybe the work required to make this trade-off work makes sense. But I can't help but feel there's got to be a better way.
I won't defend Cassandra's design choices here, but I will say that the relational model is only about joins and normal forms. The other things you mention happen to be built into all modern relational databases but are orthogonal concepts.
A distributed hash table can be built with transactions, isolation, and so forth. Such a system would offer a different set of trade-offs that might satisfy a different set of users.
Completely agree with you on the theoretical level; it's a (lazy) shorthand to contrast the typical NoSQL trade-offs with the ACID model that most relational databases employ.
In practice though, I think if you introduce (multi-'row') ACID into any of today's NoSQL database, you'd just end up with a bad traditional database (a 'relational' database without a strong theoretical grounding, without the ability to do joins and without a powerful query language.) This whole NoSQL movement feels like a re-run of the evolution of the relational database - those that don't know their history are doomed to repeat it.
The key idea is that "a per-entity-group transaction log is used. One of the rows that stores the entity group is the entity group’s root. The log is stored with the root, which is replicated like all rows in Big Table."
This is basically a version of the approach advocated by Pat Helland in his paper on "Life Beyond Distributed Transactions" -- http://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf -- namely, that the most sane approach to distributed transactions is to redefine the problem, and restrict transactions to be within "entities" that fit on a single machine. Which, as App Engine demonstrates, turns out to be enough to do an awful lot.
Interesting links - thanks. There's definitely a continuum here - the full ACID model at one end; key-value stores at the other. It looks like Google is moving in the right direction - adding more ACID - which certainly plays to my personal bias :-)
> the most sane approach to distributed transactions is to redefine the problem, and restrict transactions to be within "entities" that fit on a single machine.
How is disallowing distributed transactions a sane approach? There's no harm in allowing distributed transactions using 2PC and letting the developer decide if he wants to incur the performance penalty or wants to colocate the data-structures to the same node to avoid 2PC overhead.
If there is an axis of "commonness" upon which things can be measured, with "never happens" on the left side, and "always happens" on the right side, then "uncommon" would be a small section on the left, and "common" would be a small section on the right; everything between these sections could be "not uncommon."
I'm a fan of the NoSql movement and have been exploring Cassandra as an option for data storage.
I had a conversation with an engineer who works at a pretty well-known company here in SV and their sys admins are dropping Cassandra and pushing all the engineers back to MySql. I don't know the whole story but it seemed to be implied that open-sourced Cassandra had issues and supposedly Facebook had a much different version they were using.
Of course this is all second hand, so I tried to search on the experiences of other people using Cassandra (with decent volume). Unfortunately most of the threads I found had people just like me, at the exploratory stage. Or they hadn't been live with it for long.
If there were any pitfalls or hairy parts with maintain Cassandra, that would be good to know. Also, examples of clients who have decent load and have been using it for a while.
> their sys admins are dropping Cassandra and pushing all the engineers back to MySql
I'm curious what you are thinking of, because I have better picture of companies using Cassandra than most. :)
I do know of one company that fits your description, where some of the mysql DBAs were very anti-cassandra because, frankly, it's not mysql and that's what they were used to. But that has been resolved (the most vocal DBA left) and the Cassandra migration is continuing.
> If there were any pitfalls or hairy parts with maintain Cassandra, that would be good to know
I'm curious why a lot of people / companies seem to be picking up Cassandra lately. I'm not one to peg one "NoSQL" software system against another, but I've been using HBase for a few months (albeit on a rather limited cluster) and feel that it fits great especially with its compatibility with Hadoop. We use Hadoop + Map/Reduce extensively at Grooveshark. Does anyone have experience in using both systems and can offer a candid account of both?
> So what kind of performance can you get from Cassandra?
~10k ops/second per quad-core node. Scales roughly linearly w/ node and core counts. ("Roughly" means, obviously there is network overhead as you move from single node to multiple, that kind of thing.)
> How big can the values be?
2 GB, although you probably don't want to max that out.
Cassandra is mostly used for smaller pieces of data, although I do know at least one person using it as an S3 replacement by chunking files into 64MB chunks; each file is one row consisting of columns that each contain one such chunk.
> I was thinking of using it as a backend for a mail system.
Whoever does a thorough comparison of a number of these new "nosql" systems, including features and some benchmarks, will have him or herself an extremely popular article.
Benchmarking systems w/ very different data models is difficult to impossible, which is why you don't see that in this kind of survey piece. You're best off by picking one segment and focusing on that. Yahoo did that with the ColumnFamily stores (cassandra, hbase, and one they wrote internally) here: http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf (note that Cassandra 0.5 results are on page 16 and 17, not inline w/ the rest)
Not everyone is plugged into the "scene" RSSes. Publications purpose is to discover trends in news signals, and distill that into a coherent format supported by argument.
This and a few other poor explanations make me wince reading this. So I don't particularly recommend this article, although I've seen worse. :)
TFA does at least link the original interview w/ Ryan King of Twitter (http://nosql.mypopescu.com/post/407159447/cassandra-twitter-...) which is much better for people at HN level.
My own article at http://www.rackspacecloud.com/blog/2010/02/25/should-you-swi..., although high level, also has some useful links for those who want to drill down for more details.