Hacker News new | past | comments | ask | show | jobs | submit login
Jsonb: Stories about performance (erthalion.info)
132 points by erthalion on Dec 23, 2017 | hide | past | favorite | 55 comments



That post fails to mention something critical, Postgres doesn't scale and is not HA out of the box so you're comparing a single server solution vs another one that as built-in cluster feature ... What do you do when you need HA? or your data is too big or you need a lot write/sec ect ...

The hate on MongoDB is strong on Hacker News / Reddit but I suspect most of you never had to design a solution when master/slave is not enough.

Also: https://www.mongodb.com/mongodb-3.4-passes-jepsen-test


I'm curious whether the parent poster knows about the most recent HA features in Postgres, such as those introduced in v10, and if so, what is still missing for comparable HA. Here's a link to the HA docs for out-of-the-box postgres: https://www.postgresql.org/docs/10/static/high-availability....

As for scaling, it certainly scales out of the box with partitioning, but admittedly a partitioned table doesn't offer all the features that a non-partitioned sql table offers. A great assessment of the current state of postgres scaling can be found here: https://blog.timescale.com/scaling-partitioning-data-postgre...


While json fields in traditional SQL are useful in many ways (avoiding that infamous "extra" column) it really doesn't compare to mongodb. The aggregation framework in mongo for working with these documents is very powerful and if you run pg json fields to do anything other than retrieve extra fields (ie: if you try to query/aggregate it's fields meaningfully as you would a standard field) you'll find they fall over pretty fast with modest amounts of records.

The simple scalability of mongo (sharding/replication) is something that, while the recipient of much mocking initially, now at 3.6, gives enterprise level availability and scale with minimal oversight or expertise.

I'm really liking the features in 3.6 (over the wire compression, replication hardening, streaming query subscriptions to name a few) and every release packs a ton of features and improvements without breaking anything.

Postgre is a great relational database (one of my favorites) but its json field (which even mssql sports nowadays) is not a stand-in replacement for mongo in anyway way.


Just a suggestion that may be worth considering. Sometimes hate is confused with disagreement, and sometimes assuming people you’ve never met have less wisdom can cause hate before you’ve had a fair chance to make your point.


This is a great post but sad that they addressed everything thoroughly except the most important question for me - query performance. I don't do all this mass updating of document oriented entities. They are stored and they stay stored. But we query the hell out of them so I would love to know how well that actually performs with various indexes on various data types within the document.


Should take a look at flatbuffers. It’s an evolution of protobuf that’s super optimized for querying sunsets of data. Basically lookup of offsets and mmap all the way.


Surprising that PG beat Mongo at its own game. We've been recently using Jsonb with PG as a nice-to-have but always assumed Mongo would be much faster with schemaless data.


Well, PG is written by people who know what they're doing.


LibBSON somewhat humorously has no error handling for OOM and just kills the whole process if a malloc fails. Bit of an annoyance when porting to embedded hardware.


Come on can we keep this kind of stuff off HN?

If you have concrete criticisms say them so we can have a discussion, don't write snarky lines like this is some kind of "us vs them" game.


In an earlier comment yesterday I remarked that it was good to avoid platform wars, but the MongoDB crew squandered the right to converse as peers with all the lies they told. Remember when they compared their speed writing to memory with Postgres speed writing to disk, as if this made them faster? Either they have fundamentally misunderstood what a database is, or they have deliberately acted in bad faith, which is it?


> how to apply document-oriented approach in the world of relational databases

I've increasingly found that the number of situations where this is actually a good idea are quite, quite rare.


I'll play devils advocate.

In any real world system I've worked there are always entities that are purely self contained, but are represented by many different database tables. When they're saved, the entity is decomposed into its constituent parts and written to those tables. When they're retrieved, those tables are joined together to re-create the entity we care about. And that's all that ever happens, it's just serialised and de-serialised.

Isn't it premature optimisation to normalise that entity right out of the gate? Why not stuff it in a JSONB column. When - and only when - you find yourself needing to dig inside of it for queries in other parts of the system, then you normalize it.


Yup, and that's exactly the game I used to play.

But when you have more than a couple of developers on a project and have ... mixed ... levels of experience, things easily get out of hand while your back is turned. Before you know it you've got a dataset with fields that are... well, you don't know what they are - they could be strings, they could be true, false, entire sub-objects or arrays - they could even be json-null. And then you have to potentially cope with all those possibilities in every place you use that field.

Yes, of course you can attempt to put some sort of constraint on the data format (jsonschema perhaps?) but that's not exactly easy to perform as an in-database constraint which is the most useful and powerful form of guarantee you can have.

Of course, it also leads less experienced developers away from using postgres' rich types (even datetimes are too rich for json/jsonb!) and leaving it impractical (or impossible) to perform logic on data in-query (which is often the most efficient and robust way of doing it). This road also leads to people trying to make references to other objects within json blobs. Try ensuring referential integrity on that!

It all comes down to: you have a schema, whether you think you do or not, and if you let the database know about that schema it can help you manage it.


One word: constraints. The problem with actually storing things in a document-oriented manner is that it requires extreme vigilance and no mistakes to actually keep the shape of your documents consistent. Further, it can be extremely costly to discover all the shapes of data you actually have if you initially didn't treat it as important. More often than not, it is useful to use tools that guide you into being disciplined along the way.


I use C# with Mongo. When I get a Collection<T>, the only thing I can put in the collection is a strongly type collection and my Linq queries are strongly typed. The compiler enforces consistency.


Right, but can you ensure that the only thing that ever connects to your datastore (even to perform auxiliary/admin actions or quick data tweaks) is using that code in strongly typed mode? Otherwise you can't be sure what's in your data.


I can't be be sure that a developer doesn't go behind the scenes and change Sql server either.

But programmatically, i do ensure only one microservice (out of process) or module (in process) is writing to a collection. When I'm using an RDMS and developing a system, I also ensure that only one module is writing to a subset of tables that make up an aggregate root. I would never have a system where a bunch of programs are writing to the same tables willy nilly without going through a common interface.


> I can't be be sure that a developer doesn't go behind the scenes and change Sql server either.

You basically can. The difference is that it's quite obviously notable to a developer of any level that changing the schema in the database is something that should be done with some forethought and care. And then when that change is made, all applications accessing the database get the same new view of the schema so you're not going to have two different clients with a different idea about the schema trying to operate on the database at once because schema is global.

Making a change that suuuuubtly alters the way data is stored in an unstructured object is the kind of thing that's really easy to overlook in a pull request.

Your "only one module" rule is laudable, but you've also got to make sure that module can do everything you'd ever want to do to the database, including all the "one off" data mangling admin tasks anyone occasionally has to do. Otherwise it will get bypassed. Case in point I used to use an ORM heavily, but it wasn't possible to express everything in that ORM, so where that broke down we had to resort to manual sql. Add a few developers to the equation and you don't know exactly the format of your data.

Good schema management practise also means you end up defining all of your schema mutating operations as migrations, leaving you with a rather vital log of how things have changed over time and a good pinch-point to catch inadvisable changes to schema.


You basically can. The difference is that it's quite obviously notable to a developer of any level that changing the schema in the database is something that should be done with some forethought and care.

And if they change the schema in Mongo, with the official C# driver, by default, any program that tries to read the collection<T> and the document in the databases doesn't match the C# class, you'll get a loud and obvious error at runtime.

Of course if you want to have the flexibility of having objects change in the data store not affect downstream changes, you can decorate your class with [BsonIgnoreExtraElements] attribute.

You'll get all types of And then when that change is made, all applications accessing the database get the same new view of the schema so you're not going to have two different clients with a different idea about the schema trying to operate on the database at once because schema is global.

My contention is that you should not have two "clients" operate on the same schema. All clients should be using one module. I'm not advocating that microservices are the one true way. I'm saying that all clients should be using the same module, microservice, or stored procedures (well I hate stored procs for that type of stuff, but that's another rant)

Making a change that suuuuubtly alters the way data is stored in an unstructured object is the kind of thing that's really easy to overlook in a pull request.

There is nothing "unstructured" about a data class (POCO/POJO) in a strongly typed language. Yes you can operate on a Mongo Collection in .Net by getting a Collection<BsonDocument> but most of the time you're going to get something like a Collection<Customer> and the compiler is not going to allow you to insert, search, etc. anything but a C# defined Customer with C# defined data types.

Your "only one module" rule is laudable, but you've also got to make sure that module can do everything you'd ever want to do to the database,

That's the beauty of .Net, Linq, and expression trees. You can define your repository with a method of something like:

Find(Expression<Func<Employees,bool>> query) {...}

and the client can pass in any arbitrarily complex, strongly typed, Linq expression and your repository can pass that expression to the Mongo driver and it will translate that to Mongo query. It gives you even more flexibility because if later on you decide to use an RDMS, those same client queries can be translated to Sql by the Entity Framework driver.

But even if you don't want to go as far as "one true module" for your data, there is no reason not to have one module that defines your data types. Anyone querying against or inserting into a collection<T> will always have the right document definition of course you can put all of the definitions into a Nuget package and it will still be obvious when changes were made because your nuget package will be versions.

including all the "one off" data mangling admin tasks anyone occasionally has to do.

A one off type of admin task wouldn't be part of your production flow and you would probably bypass any abstraction that you had - including stored procedures, views, etc.

* Otherwise it will get bypassed. Case in point I used to use an ORM heavily, but it wasn't possible to express everything in that ORM, so where that broke down we had to resort to manual sql. Add a few developers to the equation and you don't know exactly the format of your data.*

There should always be one team responsible for your business objects (in DDD your aggregate roots). Why would you let any one developer change your underlying data store without a code review and/or pull request?

Good schema management practise also means you end up defining all of your schema mutating operations as migrations, leaving you with a rather vital log of how things have changed over time and a good pinch-point to catch inadvisable changes to schema.

If I'm using a strongly typed language like C# and I'm defining my documents based on a class. My Employee.cs file (and all of the child classes) are in sync with my Mongo document store. The language ensures that at compile time. "knowing my schema changes" is simply a matter of doing a "git blame" on the Employee.ca file.

Even if I'm not abstracting all of the data retrieval in a repository, I don't see any reason not to at least have a Nugst package of all of my POCOs that define the documents in my collection.


Yes, the issue you run into is that then your server has to make sure your json is the right shape.

CouchDB actually solves that problem really well IMO. You upload a validation function in javascript to the database that will validate every document that comes in. I wonder if the same thing can't be done with trigger functions in postgres.


You don't even need to use triggers. You can just write a plain old CHECK constraint on the column storing the document and use a SQL expression with the built-in JSON/JSONB operators or a stored function to interrogate the document how ever you'd like, evaluating to a true value to accept the data or false to reject.


You can but complex datatypes (json) have complex (and often ugly) constraint definitions. It's not much fun compared to "look, I just want this field to be a date."


>MongoDB has some problems with spinlocks,...Interesting enough is that this situation is getting better between releases (but it’s still quite significant), for example here is comparison of throughput for 3.2 and 3.4:

Is there an image missing or something?


Hm...no, it's right there, below this phrase https://imgur.com/a/vN0Q3


A bummer that VM’s were used. I have used packet.net to provision bare metal servers for the purposes of benchmarking. The server specs are close to the same servers techempower uses as well.


There is absolutely no good reason for anyone to use MongoDB ever, its just such an amazingly rotten database that its astounding how much hype it generated.


Maybe so, but please don't post unsubstantive comments to HN, or rant here in flamebaity ways.

We're looking for thoughtful comments. That doesn't require changing your views at all, just presenting them with more information, and an orientation to good conversation rather than venting.


*Today. The thing you're forgetting is that Mongo was actually somewhat novel at the time it was released. I've heard at least two stories of engineering teams who felt that their specific product couldn't have scaled the way it did and they wouldn't here today without Mongo, for whatever reason. I've heard the words "it was the best engineering decision we ever made" from one team who's company ended up selling for 8 figures.

You're being hyperbolic. Yes, there are so many better choices for databases out there. There are also many worse choices. Mongo is very hard to recommend nowadays, but unless you go in completely ignoring its problems its not like your company is going to go up in flames.


> I've heard at least two stories of engineering teams who felt that their specific product couldn't have scaled the way it did and they wouldn't here today without Mongo, for whatever reason. I've heard the words "it was the best engineering decision we ever made" from one team who's company ended up selling for 8 figures.

Do you expect to hear them say, "Yes, choosing MongoDB was a bad decision and our production infrastructure is currently on an unwieldy, slow, dangerous piece of crap" and then go on to sell the company for tens of millions of dollars?

Do you expect people to admit the real reason they're using MongoDB, which, in 99% of cases, is "we don't know how to use SQL and are really perturbed that anyone thinks we should have to learn"?

It is true that MongoDB only spontaneously combusts occasionally, but that doesn't mean the choice is of negligible importance.


These were private conversations, professional to professional, not conversations in the context of their sale. Maybe there is confirmation bias, but they could have just as easily said "Mongo was a piece of crap, we replaced it, it took a lot of engineering hours."

All of the complaints I've seen against Mongo are from people who read Aphyr's blog from their SQL High Horse and say "ha, look at those dirty Mongo peasants, can't you see that one in every ten million reads are inconsistent under high load? Why would anyone use this crap?" Use it in production, even under significant load, and you realize that sure you might hit a snag every once in a while, but its tolerable. It works fine. Its not trash. Its adequate. Understand the problem you're trying to solve. Understand Mongo's limitations.


All of the complaints I've seen against Mongo are from people who read Aphyr's blog from their SQL High Horse

Nope, my dislike of MongoDB the technology and distrust of 10gen/Mongo the company is 100% based on real-world experience.

By the way, one-in-ten-million means several times a day. And it was a LOT more frequent than that...


> These were private conversations, professional to professional, not conversations in the context of their sale. Maybe there is confirmation bias, but they could have just as easily said "Mongo was a piece of crap, we replaced it, it took a lot of engineering hours."

First, word gets around, "professional to professional" or not. Second, yes, people will subconsciously shift the blame to whatever they feel the least responsibility and/or enthusiasm for (aka "confirmation bias"). It's hard to find people who are sufficiently secure to even admit such errors to themselves, much less their colleagues, whom they want to think highly of them. Large doses of confirmation biases and other significant misconceptions that lead people to genuinely believe that their "shit doesn't stink", as it were, would definitely help whilst closing a multi-million dollar.

I've found that most of the technical choices that people make are simple fashion statements unless the technical limitations strictly disallow it (i.e., there is only one solution capable of providing a workable outcome). Software "engineers" and "architects" lie to themselves to justify whatever makes them feel more special and important. You don't have to look very far to see this in action (see: anyone claiming MongoDB was a good decision outside of a very unique or limited use case).

>All of the complaints I've seen against Mongo are from people who read Aphyr's blog from their SQL High Horse and say "ha, look at those dirty Mongo peasants, can't you see that one in every ten million reads are inconsistent under high load? Why would anyone use this crap?"

Please be assured that I read "Aphyr's blog" only when it hits an aggregator or when it comes up in a Google search. I haven't read his post on Mongo and edge cases where it may lose a write are of minimal relevance to my dislike for it (though this isn't something to trivialize in a database).

> Use it in production, even under significant load, and you realize that sure you might hit a snag every once in a while, but its tolerable. It works fine. Its not trash. Its adequate. Understand the problem you're trying to solve. Understand Mongo's limitations.

Yeah, I mean, I'm leaving a caveat for the 1% of cases where people do understand their workload and Mongo was the stronger fit for whatever weird edge case is in play.

But in all the cases I've seen Mongo deployed, it absolutely has been a very straightforward matter of "SQL is old and crusty, JSON is new and neato and it doesn't make us think about the structure of our data before we start spraying random values all over the place so let's use that! Woohoo!"

If we saw more MongoDB deployments that were strictly replacing EAV tables, the argument that people generally are using MongoDB for some legitimate technical function would be a lot stronger. But per usual, people are first, cargo-culting, and second, deploying something that trades the upfront effort of developing a schema with the not-so-distant development disaster that surely awaits those who refuse to plan their schemas.

You're right that Mongo doesn't just spontaneously combust, and I personally haven't contested that (I'm not the guy you originally replied to). I administer multiple Mongo clusters at $DAY_JOB and if left alone, they don't cause much fuss. But that is not the same thing as Mongo being a great engineering decision.


Hundreds of millions would be 9 figures.


Ouch, counting is hard. Edited. Thanks!


MongoDB doesn't scale. Write are handled by a single node.


a single node per shard will handle writes. If you have 20 shards, that's 20x more writes than a single node...


No, it's web scale.


It's not webscale if there is only one node doing all the work.


I used it once when it first came out just to kick the tires.

What a remarkably convenient API it had. NoSQL/"Document store" stuff was relatively new (to me, at least). Did the convenient API contribute to whatever design problems they had? No idea.

But if they had problems with durability then it really is all for naught. I didn't get beyond prototyping so I never saw whatever issues caused them to get so much ire.


Yeah, agreed. It does have a much better API than other JSON document stores I've tried. For the right use case, it would be fine.


On the contrary, It's great when you have relatively unformed data.

I've head of a lot of logging (i.e. application logging) companies have great success with Mongo because they can store their entire client's set of data without need to know the schema ahead of time.


I can easily think of a few good reasons (it is actually a pretty good database for the right use-cases). Can you actually not find a good reason or are you just continuing the exaggerating echo chamber around mongodb's earlier failures for 'points'? Plus this has very little to do with the article which is a very very well done technical analysis of different solutions with very little opining.


I think you baiting a rant, I count myself among the myriad developers who had to learn it the hard way to never use mongo.

My point would be that there are no "right" use-cases, I pointed it out because the article was talking about a specific thing: as the article points out in the outlined scenario: "It’s clear now, that throughput for PostgreSQL and MongoDB now is almost the same before the spinlock performance degradation hits MongoDB."

So the article points out there is a significant performance degradation after a certain number of connections in MongoDB, such things won't surprise me anymore I'll add it to the heap of things wrong with MongoDB, thats why I was making the comment.

> "We’ve got so far that throughput of PostgreSQL and MongoDB is the same for read workloads. Is it surprising? Not at all - I’m going to show you, that under the hood all our databases use more or less the same data structures and the same approach to store documents."

I'd argue that there is no reason to use MongoDB because there are no use-cases other more mature databases won't be equally or better suited for. MongoDB is horrid, you won't know all its deficiencies when you get started with it, but over time you will come to despise it, it is really far away from being a mature database and I would advice anyone against using it for any task, and I'm not the only one who was burned using it and who adopted this attitude towards it.


I mean, the JSON support in PostgreSQL is pretty freaking awesome now. Beyond the sharding, what are your few good reasons?


One thing I'd like to see in postgres is support for a superset of json, such as timestamps or binary data inside a json document.

Replication and failover don't look so great in postgres either.


Whenever I store binary data as text, I just base64 encode it and convert on the client.


TL;DR: Use PostgreSQL.


Tweak your tests till PostgreSQL looks like the best choice. FTFY


After reading this comment I went ahead and read the article. I am having trouble finding a case where it seems like the tests are biased in Postgres’s favor but I don’t know very much about MongoDB so could be missing something.

Large numbers of clients — really, any more than there are cores — are notably not a Postgres strength.


It doesn't seem like you read it. It's spelled out as directly as it can be.

> I assume this one is most important questions for you now. Why PostgreSQL is underperforms so significantly? The answer is simple, we’re doing an unfair benchmark.

If PG doesn't perform well, it's unfair.


> Why PostgreSQL is underperforms so significantly? The answer is simple, we’re doing an unfair benchmark. As you may notice, we were indexing only an ID in a document for MySQL and MongoDB, and an entire document for PostgreSQL.

That's hardly biased after and very much before.


Three systems treated the same, have different profiles. PG profile was low enough, the benching had to be altered, to get the results that they were expecting (or the reader to be expecting). If it was a technical implementation issue, why wasn't it done right the first time? Probably hubris, in an obvious case of tweak the testing methodology until you get the result you wanted.


A note to plot-drawers everywhere. Please don't hide your actual data-points by drawing a bezier curve over them.


And PLEASE just label the damn axes. Why do people go to all the effort to draw nice plots, write a giant blog post, but leave of the 1 minute it would take to make the plots interpretable without reading giant slabs of text to reverse engineer out what is actually in the plot.

/rant




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: