You are exactly right, however Vertica already handles the data locality, columnar data storage and data compression for us. Vertica is so good at its job that we are CPU bound on most queries and these types of strategies around reducing locking and using shorter data types make a difference.
There is always a balance here between CPU and IO. For a long time databases and big data platforms were pretty terrible with IO. However, as the computer engineering community has had time to work with these problems we have gotten considerably better at understanding how to store data via sorted and compressed columnar formats how to exploit data locality via segmentation and partitioning. As such most well constructed big data products are CPU bound at this point. For instance check out the NSDI `15 paper on Spark performance that found it was CPU bound. Vertica is also generally CPU bound.
After skimming the paper, I'm fairly confident it's not the same at all. We only managed the theoretical side of a scenario where there would be multiple TB hard drives, on multiple machines. Any efficient algorithm would work in a scanning manner, and not seek backwards beyond what could be kept in ram. We did simulate this, and the result was quite clear, IO matters.
From the paper the following 3 quotes highlight exactly why they where CPU bound:
> We found that if we instead ran queries on uncompressed data, most queries became I/O bound
> is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object
> for some queries, as much as half of the CPU time is spent deserializing and decompressing data
Go to 10gens site. Watch some of the videos and see the huge volume of data (in TPS or in TBs) that people are working with. Go on the google group and see what problems people have. Don't take an anonymous post on pastebin as the gospel.
We easily support 10s of millions of writes and reads against Mongo per hour on a very small (single digit) number of shards in the cloud (i.e. crappy disk I/O). While that is around an order of magnitude less than 30k a second I would be surprised if we couldn't scale mostly linearly by adding shards.
P.S. If your stack is KV then you should use a KV store.
It seems a bit disingenuous to post slides now reviewing a year old version of MongoDB that is almost 3 major revs behind.
These days auto-sharding works and works well. There is a learning curve to optimize it (i.e. picking a good shard key, analogous to picking a good _id field) but that is to be expected.
Replica sets work exactly as advertised in my experience. Automatic fail over and no need to wake up at 3am is a huge win.
Finally single server durability directly addresses durability concerns that many have.
The first two points are the reason why my company and so many others are switching even if that means needing to understand how to pick good ids/shard keys and having to shorten our field names by hand. I can see why, if they were not available at the time, MongoDB might not be a good choice but the fact of the matter is that they are available today.
https://www.slideshare.net/AmazonWebServices/cmp402-amazon-e...