Hacker News new | past | comments | ask | show | jobs | submit login
Goodbye, MongoDB (zopyx.de)
144 points by zedr on May 16, 2012 | hide | past | favorite | 122 comments



How tiring. We can litter the internet with posts like this, but it would be a lot more useful to post reasoned, factual and detailed posts when discussing the merits or pitfalls of a given technology.

I've launched very high traffic websites using MongoDB, where it was the least of my worries. I've also launched very high traffic websites using MySQL, where it was my primary source of pain.

Shall I now run around screaming loudly about how badly MySQL sucks? On the surface you might assume so, but with reasonable effort to consider all the facts, it becomes clear that there's more to this than just "this database is better than that one because I hate it."

This post is more of an angry payback rant from someone who got banned from IRC for being abusive to newcomers and greenies. On top of that, it is factually incorrect in some spots, and misleading in others. Not impressed.

That said, NoSQL databases are evolving extremely quickly, and have reached a point of maturity that they can excel in the right scenario. Just like relational databases, they can provide big benefits, given that the user takes the time to learn how to use them, as well as making sure they have a use case that merits the strengths of their chosen tech.

I really enjoy working with MongoDB and Redis, and have been a community cheerleader of sorts for PostgreSQL for years. It is possible to have multiple tools in your toolbox, and use them when appropriate. No tool is perfect, and no tool is perfect for every job.

Complaining bitterly that your shiny new piano makes a lousy boat, on the other hand, simply adds to the noise.


would love to hear specifically what was factually incorrect


Here's one I noticed: I have a small app running on 2.0.1 and there are _two_ prealloc.X files in the journal directory, each one is 256mb. Not quite 3GB unless my math is wrong. (There's a third file in the directory named `j._0` which is also 256mb)


This is what I don't understand.

For me MongoDB and PostgreSQL aren't really playing in the same field. Yes they are both databases. But I wouldn't use PostgreSQL for document storage (the recent JSON addition is pretty weak). Likewise I wouldn't use MongoDB when I need my data to be more structured. I would have no issue using both at the same time.

I guess everyone is looking for that mythical single stack of technologies they can use for every use case.


It seems like ranters against MongoDB often don't really understand how it works and thus what it is good for.

My simple mental model for MongoDB is: indices (should!) fit in memory and documents are stored contiguously on disk. That is it in a nutshell. A query involves in-memory lookups and maybe only one disk seek. Writes in place are usually possible.

I have been happiest with MongoDB in two different scenarios:

The first is in developing small web applications where there is no scaling issue, and the fact is that MongoDB is so easy to develop against and simply provides a great developer experience. As needed, I use a cron job to do a mongodump a few times a day; or, if I really need high availability (which, frankly, often I don't: if a system is unavailable one or twice a year for an hour it is no big deal) then replica sets are OK.

The second scenario where I have really liked using MongoDB was doing analytics on a modestly large stream of social media data. A single Mongo master on a large EC2 instance was adequate to handle writes and slaves on other large EC2 instances each fed a different analytics application. This setup of apps reading from a slave on the same server worked really well for me. This was a low hassle experience.

I do have one customer with really large MongoDB setups on multiple data centers, and I am working around right now on some hassles, but we haven't found anything else as cost effective for the customer's applications.

All that said, when I can use it, just using a single (no horizontal scaling) PostgreSQL server is for me the most hassle free developer experience, but I have always used PostgreSQL for small or medium sized applications - nothing that needed to scale.


You mention two different scenarios that use MongoDB effectively, and both are not what MongoDB purports it's supposed to be used for.

The first is a small setup. MongoDB is called that because it's supposed to be for "humongous" data sets. Also, in what case is 3GB of data preallocated for journalling, which is one of the points in the article, good for a small installation?

The second is "modestly" sized stream of social media data. Having myself worked with a much larger stream of social media data in Mongo, I can attest that the second you leave the land of a single Mongo server, you have a much bigger problem, sharding. Sharding is terrible in Mongo, writes are shard locked, not collection, your shard keys are immutable (imagine having an indexed field you can only set ONCE), and fraught with data loss. Did you have a drop in network connectivity between your mongos and you config server? You just silently lost data. Safe being on doesn't matter, if the config server doesn't get the write it isn't able to report where the data is for a read, even if it is confirmed it was written to disk.

To your point about mongo ranters not understanding what it's good for, MongoDB tells everyone it's good for "big" and "fast" data. However, it fails at both of these, because it doesn't easily scale up from one node, and it doesn't write quickly when you actually want to make sure that your data is there. What's the point with writing data quickly to something that looses it quietly? Might as well pipe it to /dev/null/


I feel like 1 - 2 years ago I was reading a slew of blog posts with the title "Why We Chose MongoDB." Now it seems like all of the blog posts are some sort of "We Just Finished Migrating off of MongoDB, Here's Why."

I know nothing about MongoDB and have never tried it. But the message seems pretty clear.


The message is, just like every technology, there's an initial period where a vocal minority loves it and tries to use it for everything. Then, there's a backlash where a vocal minority hates it and thinks anyone who uses it is clearly an idiot. All the while, the silent majority go on getting work done. It's been this way for as long as I can remember.


I fully agree here. A couple of years ago, when I went on interviews at startups, they were all using nosql db's and were proud of it. More recently, I've been interviewing at startups that are now bigger, need to mine the data that they've collected over the last few years, and now are migrating off of nosql db's to rdbms' (or creating strange amalgams of the 2). I did see this coming, but it was very hard to make them understand back when it was the coolest thing.


The reality is that developers are maturing in their understanding of different technologies and they are learning how to apply them in correct use cases.

There continues to remain a "golden hammer" syndrome where white horses and unicorns run free, but it doesn't exist.

Instead, the vision of "NoSQL" was to tell developers that they did not have to use relational data for everything, but could, instead, use the right tool at the right time. Why is this such a hard concept?

If you are a developer and you don't understand the tool you are wielding (it's pretty clear the author of this blog didn't), then you will incorrectly use the tool and experience pain.

That is the fault of the developer, not the tool.


Care to specify at which point the author of the blog post misunderstand MongoDB? Its not pretty clear to me about that.

Are you saying that one of the way he is using MongoDB is incorrect? If yes, what point did he say reflects that?

His rants seems quite specific.


Love it. I think Node is the hot shit right now, and Rails is starting to get into the hate-period.

It's really just hipsters and music. We are all really just hipsters.


>I think Node is the hot shit right now, and Rails is starting to get into the hate-period.

I think you're about two years behind. Node is starting its hate-period. Rails got it a year or two ago.


What's the hot shit right now then, if not Node?


Meteor, websocket-dbs.


Meteor runs on top of Node.js


Node runs on Linux, but Linux isn't the hot shit.


HTML5 a panacea.


All developers need to understand the Gartner hype cycle.



It's just the standard Internet hype cycle.

I'm guessing within the next year we're going to see a similar backlash against Hadoop as people who rush to it begin to discover that map/reduce isn't necessarily the best distributed processing model for their needs. This despite Hadoop continuing to be the effective (non-silver) bullet it's always been.


Meh, Hadoop is much older than MongoDB, is an offline system, and doesn't destroy your data. There is also not much in the way of alternatives. I'm skeptical we will see many "we're switching off of Hadoop" posts anytime soon. That said, there is an growing undercurrent in the Clojure community to roll your own map reduce system instead of using Hadoop.


There's quite a bit in the way of alternatives. Sector/Sphere, Xgrid, and PVM come to mind. Sector/Sphere is perhaps the most direct alternative since it comes with its own distributed filesystem, but if you don't need a full-fledged DFS then the others are worth looking at too.

The main thing that really distinguishes Hadoop is that it's built to only do one kind of distributed processing. In exchange for asking you to don that straitjacket it offers ease of use. However, if your problem isn't naturally a map/reduce problem, or if your main performance bottleneck isn't disk I/O, then the alternatives become a lot more attractive.


I wouldn't draw any conclusions based on the hive mind of social media story up-voting.


There are a lot of people still switching to MongoDB for various reasons, but its no longer the cool kid on the block and thus not a lot of people are going to brag about switching to it.

I think a lot of people switched to it because it was cool, and maybe assumed that it could solve any application data storage problem and are now finding out that it may not have been a great choice for them.

I don't think its appropriate to take away from this that MongoDB and/or other document stores are bad. Instead, I think its important to understand how they work and decide how well it applies to your use case. It's not going to work well for all applications.


While some of the author's criticisms are valid, some of them are completely wrong:

> Having no option to perform an operation comparable to UPDATE table SET foo=bar WHERE....

What? db.collection.update does exactly this. See: http://www.mongodb.org/display/DOCS/Updating#Updating-update...

MongoDB fit a nice niche for a read heavy mid-scalability db solution. Every DB has it's niche. Trying to use it outside of what it's good for is going to get you burned. If people just did their research before blindly committing to a platform, we'd see a lot less posts like this.


Agreed. I found the mapreduce criticisms to be a little off:

> Now instead of fixing a bad implementation or fixing the underlaying architectural issues, MongoDB is moving to Hadoop.

I don't think that's accurate. They have a new "aggregation framework" coming that is meant to replace mapreduce. It could be a wrapper around hadoop, but I couldn't find anything documented about that. I completely agree that a blocking mapreduce is annoying, however, does any framework have a non-blocking mapreduce? I haven't tried many mapreduce implementations out, so this is a genuine question.


The aggregation framework is meant to fill a gap between SQLs SUM, COUNT, AVG, etc without requiring a full map/reduce. The Hadoop integrations are unrelated and are just a nice little bonus that they added.


I don't believe RavenDB blocks. Instead, it returns the results along with a flag to indicate if any of the source data may have been modified while the operation was running.


You're right, all RavenDB indexing operations (including Map/Reduce) are done on background threads. When you query, it returns a flag (as you say) indicating if the index is currently stale or not.


I was wondering if he meant copy fields from one column to another, or to modify a value in a document based on the value in another field. I don't think that's possible in Mongo without writing server-side JS or doing it in your client.

Agree with your general point, though - this seems like they have a product/tech mismatch. Though I'd argue that this isn't a niche - read heavy, mid-scalability is a lot of the web.


Well the other thing is: people act like migrations are the devil.

Yes, moving your data from Mongo to PgSql, MySql, MSSql, Oracle, etc. is going to be difficult.

However: why start with a production level Oracle install anticipating "web scale" when you're not going to have more than ~1k users at launch.

People look to Google and Twitter and Facebook for advice on how to scale. They then apply this advice far before they need to.

I don't think you should plan to be large scale from the get-go. I think you should have a plan for what you're going to do if you get big.

My $.02


I really don’t understand how they used it for 2-3 years without knowing this.


> Leaving memory management to the operating is nice idea - in reality it does not scale and does not play very well.

This is why I think that Linus's tirade against O_DIRECT is misguided: https://lkml.org/lkml/2007/1/10/233

Here's the thing: the kernel is a library. It took me a long time to fully understand this deep idea. The kernel is just a library that has a different and more expensive calling convention (syscalls) and runs at a higher privilege level.

It's also much less flexible than user-space libraries. Its interface is an unholy mix of syscalls, ioctl(), /proc, vdso, etc. There is a high bar to adding new interfaces. Removing or changing existing interfaces is basically not allowed.

The resources that the kernel uses are much harder to account for or predict. How can you ensure that a process always gets at least X MB of page cache, and that some enormous "cp" that some sysadmin is running won't evict all your MongoDB pages that are caching your database? Sure you could mlock() your pages, but now you're basically side-stepping all of this smart kernel cache management that was supposed to be helping you so much in the first place.

User-space management of buffers and caches is more flexible, easier to account to its owner, and more predictable. It can't handle page faults with Linux's current interfaces, but the L4 guys have figured out how to let pagers run in user-space and handle page faults. I hope that someday this work becomes mainstream. Our 20-year-old OS design is showing its age.


This is why most DBMS products do their own memory management. They just suck up as much memory as you will allow and use it for their own devices. Specialized, tunable, application-level memory management will probably always beat a general-purpose, application-ignorant OS-based scheme.

But, there's something to be said for the simplicity - for most folks you don't need to manage the memory yourself. When you need it though, there's really no easy substitute.


> But, there's something to be said for the simplicity - for most folks you don't need to manage the memory yourself.

Yes, and most people don't need to implement printf() themselves, which is why there is libc. Just because you're doing it in-process, in user space, doesn't mean you're rolling your own!


So does Java, for instance.


There is no single way to control the memory usage using system tools except maintaining mongod instances on dedicated virtual machines without running further services. There are numerous complaints from people about this stupid architectural decision from various side and 10gen is doing nothing to change this brain-dead memory model.

Can someone explain to me why this is actually a big issue? Except for really tiny apps, I imagine that having dedicated VMs for your MongoDB actually would be perfectly fine? Probably even preferred?


I was wondering the same. It's pretty standard for databases to not be very good about sharing with others, memory-wise. That's why having a dedicated server is such a popular best practice for non-puny applications. And MongoDB says it's not designed for puny applications right there in its name.


There are fairly straight-forward ways to control memory usage in MongoDB, especially if this is a concern for you. All you need is a bit of OS know-how and it works just fine.

We run both large (multi-shard clusters) and small memory (500MB - 2GB) use instances of MongoDB and have no problems.

It would be good to have developers acknowledge that, perhaps, they may not have all the information instead of declaring that something can't be done.


Care to explain how, say in linux?


Either jason has something really interesting up his sleeve or he simply doesn't know what he's talking about.

Last I checked linux had no interfaces to partition the pagecache in a meaningful way, short of rather extreme gymnastics involving kernel-patches or tmpfs abuse.

If that has changed then I'd certainly also be curious to hear about it.


Pretty much any sane architecture has machines dedicated to handling the data store (MySQL, Mongo or whatever). If you are scaling up from a single machine, this is the first split you make.


Hell yeah.

From what I've seen of MongoDB I'm not impressed at all. In some carefully controlled cases, performance would be acceptable, but change anything at all (even the order that data is inserted) and it just sucks.

For one particular application, the performance difference between MySQL and Mongo was like the difference between the Space Shuttle and a Chevy Sonic.


I think it's important people understand WHY this is the case, because it's not simply an issue for MySQL vs Mongo.

The simplest way to get really good performance for multi-row queries (even if you fall out of cache) is to physically order your data in query-order. (that is, if you are going to ask for the most recent 100 blog comments, order the comments by (blog-post-id, reverse comment-date).

MySQL -Innodb makes this really easy (your data is physically ordered by primary key). In MongoDB it's not possible.

Here is a more elaborate explanation...

MySQL-Innodb stores record data in primary-key order (it puts the data right into the b-tree). This means that if you want to access 100+ records in primary-key order, it's pretty darn efficient. Even if it's out of cache, it could be just one or two disk seeks (depending on how many records fit in a block)

MongoDB stores record data in a heap-table in a semi-random order based on insertion and freespace, with each document receiving a "doc id". You can make an index on whatever you want, but when Mongo fetches multiple records, it cross-references every index entry with the doc_id. If your data is out of cache, this means a seek for _every single document_. AFAIK, as of 2012, there is no way around this, because there is no way to get mongodb to store the document data directly in the b-tree. This is a big part of why Mongo is super-slow if you fall out of cache.

HOWEVER, there are some other systems that also have this problem, including some ORMs that sit ontop of MySQL. Ruby-on-Rails forces you to use an auto_increment primary key for every record -- which means even if you use MySQL, you are forcing your data to be in a semi-random insertion order. Django does this as well. If you want to efficiently fetch a bunch of records (like 100 comments on a blog post), then you want them to be in primary key order.

In the SQL world, Oracle, MS-SQL, and Postgres also normally use a form of heap-table for records... This allows records to have a physical "ROWID" which can be used to directly look them up (it's a physical block address with an O(1) lookup). However, it also means they are in semi-random order. The ROWID was an important join and foreign-key optimization back in the days of nightly SQL jobs on machines with very little RAM. Today, it's not a good optimization. B-tree indirect blocks are always in RAM, so direct ROWID lookups have little benefit over b-trees, and the huge drawback of no natural primary key ordering. One can workaround this problem with fully-covered-indicies, table-in-index, and key-clustering -- all of which have their own frustrating tradeoffs.

Does primary-key ordering have downsides? References to records are bigger (They have to contain the full primary key), there is no guaranteed stable way to reference a record (for foreign key constraints), and if you change fields in the primary key, the entire record must be moved. In the "old days" there was another big drawback, b-tree O(log-n) lookups are much slower than ROWID O(1) lookups.. Today this is not an issue because all b-tree indirect nodes fit in RAM always.

Bottom line, if you want the easiest way to order data properly, use MySQL-Innodb, and choose your primary key wisely. If you are using another storage system, study up on how you can control physical ordering becuase every system is different.


Does MySQL(InnoDB) allow you to cluster on something other than the PK?

Big advantage in that option (Postgres, MSSQL, Oracle).


I like your analogy. I just imagine that Rob Dyrdek kickflip and the space shuttle doing the same thing like a whale breaching in the sky haha.


But performance is just one of many aspects of a database. Relational databases are a nightmare to develop for when you have a fluid object model.


It's never been a problem for me

create table whatever_attribute ( whatever_id integer primary key, name varchar(255), value text )

if you never need a special index you can just stick new attributes onto "whatever" and never recompile


>if you never need a special index

You need to use indexing unless your app is a toy.


"My essage to companies building applications on top of MongoDB: assigned smart people to MongoDB and don't leave the database work to people that can hardly spell their name or that can just count to three. Yes, this paragraph is harsh and does not comply with diversity but it is true and reality. The number of people that should not do any database related work, people without reasonable background, people lacking basic skills in understanding databases is extraordinary high."

Isn't that true for any database? What point are you trying to make? That a large MySQL deployment can be flawlessly be maintained by people that can "hardly spell their name"?


I also like "essage" and "assigned smart people" instead of "assign smart people" in a sentence talking about intelligence and spelling. He's probably not a native speaker but that only excuses the second bit, and only partly.


Yes it turns out that to operate a DB, you do need DBAs.


This is kind of a strange list of complaints.

MongoDB memory management is a legitimate concern... but not because it's hard to control memory usage of a single mongod.

"More granular locking" is a temporary, non-scalable solution?

I've run out of energy, actually, but really?


Ya, there are lots of well reasoned complaints about MongoDB out there. This is...not one of them. It's weird that this is getting voted up.


This reads more like a rant then an actual discussion of problems the company was having with MongoDB. There is a place for valid criticism, but this is the polar opposite. I am actually more interested in the fact that this made it to the front page so fast tehn the actual content of the article. Are there so many people upset with Mongo that even something as poorly done as this rant can get publicity ?


We dumped Mongo for a lot of the same reasons listed in the article. Their python driver isn't great either.


In favour of...?


In my case, Redis. Really good choice for my project


I don't understand how you would switch from mongodb to redis. They don't do the same thing, at all.


I think your second sentence answered your first.


How does someone mistakenly choose mongo for a job fit for redis?


Curious to hear more about your complaints with PyMongo. Care to share any specifics?


Wow, how many "Goodbye, MongoDB" stories are coming? The last days had several ones. Not sure if this is already a trend?


The trend is this: any loved technology that's been around a little while (Rails, Node, Mongo, etc) makes for dramatic farewells on hackernews. It's not unlike supermarket tabloids.


I disagree. The trend is this: any hype has an equal and opposite anti-hype. 10gen have spent (probably) millions of dollars and countless hours representing MongoDB and singing its praises in various media (conferences, user groups, Web, etc.). If you do so, you need to expect equally fierce reactions when people realise your claims are unsubstantiated (or, at the very least, not as universal and problem-free as you originally portrayed).


Very good point. I never thought about it that way.

I have been criticizing them for their default settings (no response to write requests, safety turned off by default) for a while.

But actually it is not as much for settings themselves, but for lack of a clear red flashing warning on their front page. As their default settings could (still can?) lead to silent data corruption. The worst possible thing to happen to a database. That is shady and shitty practice if you ask me.

They are trying to market their technology to other developers, I expect them to at least try to be honest and open about the characteristics of their product. If they treat other developers like they are customers for male enhancement pills in 4am commercials, then I think they should also expect some backlash from said developers when they choose Mongo and then hit all the hidden assumptions and un-delivered promises.

Sure it is in the fine print. But if it is important, why not put it in the big print. It is a lot less painful for everyone in the long term.


I wonder where all those millions come from? All venture capital, or is 10gen gaining paying customers?


As a former Aol employee, I know of a large support contract between them and 10gen. I think a lot of money comes from large companies who have teams that want to use MongoDB, and the large company buys a huge support contract for that team and any other possible team who wants to use it. Maybe that was just Aol, but I'm guess that pattern is that same in other large tech companies too.


"Trough of Disillusionment"? http://en.wikipedia.org/wiki/Hype_cycle


"Object oriented programming is dead" "Stop subclassing right now" "Java Is A Ghetto"


IIRC We've had several "Goodbye CouchDB" articles in the past few weeks too. What's the under/over on the inevitable "Goodbye Riak" wave?


Perhaps soon, but one thing that has always struck me about Basho (and maybe why Riak isn't as well-known as the other NoSQL DBs) is that they don't have a huge hype machine. They've always been more interested in fixing the problems with their DB and cultivating a community that understands the benefits/detriments of their database than blasting out "HEY EVERY1 SHOULD USE RIAK!!!" everywhere. Their site is very specific about what Riak is good at and what it's bad at.

I've seen many mailing list responses to questions about Riak saying "maybe Riak isn't the best fit for this, try looking at X..." whereas 10gen wants everybody to use MongoDB for everything.

It just seems to be a more honest operation, and people don't get burned as much.


As a paying customer of Basho who has had several direct experiences with sales and technical support, I second this. They are incredibly helpful in the event of a problem and are passionate about educating their customers to provide the highest chance of success with their products. They are honest about the limitations imposed by Riak and will go out of their way to help identify a solution. I have literally never had as positive an experience with a technical support team as with Basho. It is very refreshing.


MongoDB really is the MySQL of the NoSQL movement (http://dieswaytoofast.blogspot.com/2012/02/mongodb-mysql-of-...), which really, really causes issues. AFAICT, the mind-set tends to be (1) Theres this new thing called NoSQL (2) MongoDB is the biggest NoSQL store out there, lets use it (3) Profit! Omitting a - critical - step, viz., "what are we trying to solve for here?". The result, of course, is either people getting burned, or a bit of a mess to unwind, neither of which is particularly fun (though it is "Profit!" for others!)

Riak's relative obscurity actually serves well in this regard. Odds are that if you discover Riak, you do so for a reason - you've been searching, you have a specific itch that needs to be scratched, etc. Given that you are already pretty clued when you get to it, you are extremely unlikely to pick it for the wrong reason, and consequently the success rate tends to be astonishingly high. Mind you, I'll add that the folks at Basho are some of the nicest people I know, and that helps a lot :-)

Bottom line - I really don't expect Riak to go away anytime soon...


Ppl seems to hate successful projects. Mongo is probably the most successful noSQL db out there (not going in a war on its merits and flaws) and now people hate it. I wonder how long will take for us to start reading stories on "Why I hate Riak"


But.. but.. MongoDB is web-scale: http://www.youtube.com/watch?v=b2F-DItXtZs

(Note how the video raises some of the same concerns as the blog post)


Because nothing adds to the cogency of one's argument like cartoon animals...


I'm amazed at all these (excuse me) idiotic articles. People/projects have different requirements, so there are many databases around(relational and nosql and key/value). Just because your needs do not match MongoDB's (or MySQL's or ...), does not mean the technology is useless.


The biggest problem with mongoDB IMO is that BSON dictionaries are ordered. Let that sink in for a sec: the hash data structure must be ordered.... The solution most drivers run with is to just alphabetically order each dictionary.... a ineffcientcy I'm not really happy with.


Look, someone else realizing that Codd was right.


Right about what, exactly? The relational model certainly has its advantages, but it is not without its disadvantages, either. When it comes to databases, there's no one model that caters for every situation.


This is completely unrelated to the subject, and I have not used mongodb, have no idea if it's good or bad, and I don't even know how to spell it, but isn't it ironic that this post shows up on the day when the top HN post's title is "Please learn to write"?

As I said, no idea how good or bad mongo is, but I'm guessing, if you are as sloppy in your code as you are in your English, I'll be happy to give mongo the benefit of the doubt...


Its unfortunate that right now none of the 3 major document stores seem to be doing all that well or are easy to use straight out of the box. I use and like mongodb but only for prototyping. I havent decided what to go with longer term if my projects have a need. Couchdb is interesting but seems to be going through some serious growing pains right now with the couchbase product being very confusing to figure out and use. Riak is also interesting but it seems more specialty then a general purpose tool.

Kind of a bummer.


I've done a lot of research and testing with Riak (not had the pleasure of using it in production yet), and although some of the things that are easy in other DBs are backwards (like listing all records, for instance), I have a secret love affair with it. MongoDB claims to have simple scalability, but I've worked with it on a large project at Aol and found this to not be true at all. We basically had to implement our own sharding on top of it since its auto-sharding was so poor, and only needed to shard in the first place because of the global write lock. Riak, on the other hand, is an operations dream, and if you're a dev and ops guy in one (like me) I stay up at night thinking about using it for every project.

That said, a lot of people have started to use Riak for a general purpose tool. There are some development growing pains associated with this (and you have to think very carefully about your keys and data structure) but it's only getting easier with things like secondary indexes and Riak search. If there was a non-expensive way to enumerate all the data in a bucket, I think that'd be the one last item on the checklist before I jumped on it.


I keep reading about how mongo's use of memory mapped files is real bad. Isn't that the same technique used by varnish cache and that's what makes it awesome? Can someone explain please?


Varnish has a much simpler access pattern (which happens to fit the page-cache semantics like a glove in almost all use-cases) and was developed by a drastically more competent team.

Comparing varnish to mongodb is akin to comparing a precision Rolex from Swiss to a plastic mickey mouse watch from a gumball machine.


Actually, using mmap-ed files is a great idea. It's precisely what Varnish does too.


Yes, using mmap is a good idea and not because it is easy. The problem is that mmap-ing is simple, so a lot of first-time database engine developers use it, but a naive usage of mmap is a recipe for poor operational behavior. A competent database engine that fully leverages mmap is actually pretty complicated internally. Things like managing write back scheduling behavior are important.

mmap() is a good way of doing things but for different reasons than some people assume. Like all tools, you have to learn how to use it well.


The Redis project abandoned its VM implementation, to focus on in-memory operation, while suggesting that datasets > RAM should be sent to an RDBMS. This has resulted in a few success stories where the Redis+RDBMS combo has been successful.

The old-school DBA in me immediately thinks that the "70's tech" of the RDBMS is still widely used because the time to develop sufficient memory management is hard work and takes time.


That was exactly what my thoughts were too: letting the OS do all the memory management, caching, is a strategy many great projects use, among which PostgreSQL and Varnish.

However, I do feel there is something "wrong" about the approach MongoDB is taking. They need to allocate new files in huge buffers, which completely take up all I/O while being filled with zeroes. There is no logical hierarchy in the files, and it just feels a bit weird.

Perhaps they should've taken the approach PostgreSQL did, which is to simply use files and read from them instead of using mmap. The whole reason they went for a global lock instead of more granular lock is because the whole mmap'ed area is one big blob, and it was the most "obvious" approach.


Thanks for the insight on the global write lock. I've searched and searched and wasn't able to find anything on why they have the global lock.

Out of curiosity, is there a simple way to explain why someone would mmap instead of just reading files directly (I've never done any programming with mmap, so I'm a bit ignorant of its use cases)?


This SO entry seems to answer your question fairly well: http://stackoverflow.com/questions/258091/when-should-i-use-...


mmap also blows-up the TLB cache by taking up so many page addresses.


It is a widespread myth that postgresql lets the OS do all the memory management and caching. I don't understand why it is so prevalent though considering it is so trivial to look and see that it is nonsense. Postgresql reads all data into a shared buffer cache. The data is stored in files, so of course the filesystem buffer cache is also used, but the idea that postgresql leaves it entirely up to the OS is totally false. It has its own cache as well on top of the filesystem cache.


So what is the new solution? Back to relational? Or another type of document store?


MongoDB is fine at the conceptual level. The problem is that the architecture and implementation are consistently poor in myriad ways. It is a case study of what can happen when well-meaning individuals with little experience in database architecture and implementation attempt to build a scalable database engine.

That said, a competently engineered RDBMS can do everything NoSQL databases can, particularly limited databases like MongoDB. The caveat is that you have to learn how to use those databases; they are very feature rich and powerful but that flexibility makes them more complicated. PostgreSQL is a very good choice from the open source world and is just as fast as NoSQL du jour in the hands of someone that knows it.

I currently design extreme-scale real-time analytical database engines, so I have no vested interest in any particular solution (we are not really competing with the current market). If I was going to build a large-scale web app today and needed a backing database, I would go with PostgreSQL -- it is very capable and well-engineered.


...a competently engineered Turing Machine can do everything NoSQL databases can, particularly limited databases like MongoDB. ... Emacs is just as fast as NoSQL du jour in the hands of someone that knows it.

FTFY.

The difference between using a good competently engineered distributed database and PostgreSQL is that when your prime concerns are horizontal scaling and operations costs, the distributed database can be an order of magnitude simpler and faster than PostgreSQL given the same amount of effort.


Both, I am waiting for a JSON data type in a relational database where you can index, filter, order, etc on keys.


Blame yourself before blaming MongoDB. If you've been around the software industry you should always be mindful of fads and vaporwares. When you make a decision to use MongoDB you better have done your homework first or do some testing yourself. Given the low cost of renting bunch of EC2 machines for a few hours, it's idiotic to build a business around MongoDB or any other system that has not been fully proven without doing bunch of stress testing yourself. Yes, and don't trust software vendors, get independent advice or test it yourself.


A good read. I'm working on a relatively large project now, written in Node (whee), and I was considering going full-koolaid with Mongo. I think I may stick to MySQL.


Give PostgreSQL a shot if you're still considering databases.


I love Postgres, but I'm building something for OSS use and most unzip-and-deploy devs don't know how to set up Postgres.

Any additional information that you might be able to give me that might convince me?


I'd look into Postgres more first. They have some features that give you some of the convenience of a document DB. I'm not an expert but from what I've heard stuff like JSON/XML tables and customized column types that can store things like arrays and other structured data instead of just scalar data.

It sounds pretty powerful and I've been told it is rock solid. That being said, don't buy into the anti-mongo hype so easily either. It's been overstated and for a large portion of the middle-ground on scalability and performance mongo is great.


Its not difficult to setup - included in most package managers; EnterpriseDB produces installers for desktops / workstations; a straightforward build process from source.

The perceived "ease" at which MySQL is available, thus a lack of actual understanding about how to work with an RDBMS, is why MySQL is so widespread yet such a maligned platform.


Ah. If your target is people who aren't going to be reading docs, I don't think any DB is really preferable. :)

Any more details about the project, or is it under wraps?


It's basically a simple blogging platform, going to be released OSS.

Most devs know how to get MySQL running without too much thought, but many don't even know what Postgres is.


It might be worthwhile educating them. It shouldn't be that much harder to show them how to setup Postgres than MySQL :-)

If it's only a simple blogging platform, it might well be that MySQL will do the job just fine though.


Well with Postgres you can include an installation in your distribution, with MySQL you cannot.


It's not ready for primetime in DBs that require a great deal of relational mapping. Best cases are apps with very compartmentalized documents, a more robust caching server or something that doesn't require a lot of cross model mapping.

We found out the hard way with our app, with which we wanted to do a lot of associating and joining of information. You either get locked into the document, or locked into writing interpretive code, and neither case is fun to work around.


My Xen VM crashed several times by day during several weeks. And, this problem disappears when I removed my mongodb database...


Some kernel version had issues with block devices causing kernel panics, a minimum of 2.6.36 is needed.


using JSON as a query language was a bad decision. The current JSON query language works for standard queries but the functionality of the operators is limited.

These two things don't go hand in hand. JSON could be used to elegantly represent complex queries. A problem with the query system isn't necessarily a problem with JSON.


>JSON could be used to elegantly represent complex queries

I think it could be used to represent complex queries, I don't think elegantly would be it. I'm thinking it would be a small step up from an XML representation.


We started out with just MySQL. Then added MongoDB + replicasets. Then added Cassandra. And now we just finished adding Elastic Search. All of this for the same Web Application. Use the right tool for the job. The pattern i've noticed is that indeed we started migrating DATA out of MongoDB, mostly to Cassandra.


is there any replacement for mongodb?



tl;dr We LOVED MongoDB (http://www.zopyx.de/blog/plone-using-highcharts-and-jqgrid) but we got burnt so it's USELESS and BRAINDEAD!


That's the common pattern. MongoDB is easy and convenient when you first start using it, but you hit the limitations pretty quickly. I had the same experience.


I really think if they had document-level locking, it would be a much more successful DB. 10gen's answer to this is "just shard!!1" except that sharding in Mongo really isn't as easy as they make it out to be...not to mention what database makes you shard just so you can support more than a few hundred writes per second? Maybe this has improved a bunch since I got burned by it on v1.6, but I believe it's very misleading to claim scalability of a product that has a global write lock. Being forced to shard when you reach the small-medium size is just sad.

All other aspects about it I love, though. It really makes development and deployment so much faster. Replica sets aren't perfect, but setting up MySQL replication w/ automatic failover on three or more machines is a recipe for disaster unless you have a DBA to sit there and baby sit it full-time.


Locking in particular has improved substantially since 1.6. For example, 2.0 introduced yielding in some cases where MongoDB would go to disk rather than page-faulting with the lock held[1]. This has been extended and improved for 2.2, along with increasing the granularity of the lock from process-wide to per-db. There are plans to increase the granularity further in future releases.

[1] To see an example of the difference this makes see http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html


mmap files and sharding...

It seems like the problem is that you're not using MongoDB in a sharded setup to begin with. For good or bad, MongoDB targets the scale where you need sharded and replicated setups. In other words, a large enough operation to require multiple servers for data storage. If you need the opposite of that, which is multitenancy, MongoDB is not going to be a good fit.

On the other hand, MongoDB has always been sold as a rapid prototyping and easy to iterate datastore, which is attractive for people working on small projects. Then they have an "oh shit" moment when they run into operational issues.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: