Mongodb IS web scale: hadoop-mongodb

jbellis · on Dec 22, 2011

My comment from TFA. Context: Paul Querna correctly pointed out that Cassandra has had Hadoop integration since 0.6, and the author replied that Cassandra was complicated.

If your objection to Cassandra is "it's complicated," you have no business running Hadoop. :) How to set up a Cassandra cluster in under two minutes: http://www.screenr.com/5G6

If on the other hand you simply made statements like "Mongo is the first NoSQL to nail painless Hadoop and Pig integration" without doing any research, then you should probably edit your blog post.

rjurney · on Dec 22, 2011

I don't have anything against Cassandra. I will take a look at your link, and see how easy it is to integrate with Pig. I would be pretty excited to have another painless option available. The fact that Cassandra works with Whirr is very, very cool. However:

Cassandra's documentation: http://wiki.apache.org/cassandra/GettingStarted

"Cassandra is an advanced topic, and while work is always underway to make things easier, it can still be daunting to get up and running for the first time."

Change your docs, or demonstrate how to one-liner push data to Cassandra, and I will happily update my post. Shadow puppet docs do not count.

Your statement about Hadoop being complex illustrates EXACTLY the problem I'm trying to solve. 'Big data' usability. ;) Amazon EMR against records in S3 with Pig is not hard. Publishing data from S3 via EMR to Mongo in Heroku... that is not hard either. Wow, suddenly 'big data' is open to anyone using Heroku. That is a big deal.

cbsmith · on Dec 22, 2011

Gee. I kind of thought HBase & Cassandra had Hadoop integration pretty much down...

squarecog · on Dec 22, 2011

The HBase integration with Pig is pretty good (disclaimer: I wrote a bunch of it, and use it on a daily basis). The only thing is that you need to create the table and set up column families yourself. The mongo driver Russel demoes automatically creates a table which may or may not be a good thing. Also, he didn't actually say anything about scalability except for linkbaiting in his title :).

rjurney · on Dec 22, 2011

HBase integration is good. But having to deal with column families, etc. rule it out for me in terms of solving the usability problem. I just want to push records and retrieve them as JSON. This is the most common use case when publishing data from Hadoop to a NoSQL store. I think this could be fixed? Can column families be inferred? I am highlighting Mongo's superior usability here to set an example for others.

squarecog · on Dec 22, 2011

I would argue that any time you put "just" and "terabytes" next to each other, you are heading for big problems to go with your big insights :). Schema-less is great.. until you can't find stuff and your data is full of inconsistencies.

rjurney · on Dec 22, 2011

I've operated this way in practice, at scale, and it works fine. You're rebuilding your entire store and swapping it out frequently, so data consistency isn't a problem. The key is to have a painless pipeline setup, so that one person can do the entire thing... thus negating the need for contracts between parts of the stack.

rayglover · on Dec 22, 2011

You might want to take a look at Lily; a document store that runs on HBase: [http://www.lilyproject.org/lily/index.html]. It exposes a REST interface to perform CRUD and search operations, and has a very expressive object model (including record-to-record links).

rjurney · on Dec 22, 2011

Lily looks cool, thanks.

kevinpet · on Dec 23, 2011

Sure it is. I'll grant you that Mongo DB is web scale. Now could someone tell me what web scale means? The whole point of that xtranormal piece was the "web scale" is a meaningless marketing term. You can't argue that something qualifies for a meaningless marketing term.

cliff notes: Article didn't define web scale, therefore I didn't read meaningless article.

gsteph22 · on Dec 22, 2011

rjurney · on Dec 22, 2011

Hey, whatever you think about Mongo... it can probably work well as a read-only key/value store. Which is all that is needed to publish data from Hadoop, because you've already batch processed it into presentation form.