This just lines up with what we've seen in the KV space over the last 5 years. M...

preetamjinka · on March 25, 2015

> But bolting SQL onto a KV store, even a really good one, isn't trivial to do.

Sure, but isn't that basically what everyone does? All of the relational databases I know use ordered key-value storage engines. FoundationDB is the same, except it's distributed.

The point is to use FDB as a foundation. I heard how one of the FDB founders replaced SQLite's B-tree storage engine with FoundationDB, which turned SQLite into a distributed SQL database. It's much more difficult to make something like that if you had to tackle the hard distributed systems problems.

(I interned for FoundationDB a couple of years ago.)

jhugg · on March 25, 2015

> Sure, but isn't that basically what everyone does? All of the relational databases I know use ordered key-value storage engines. FoundationDB is the same, except it's distributed.

> The point is to use FDB as a foundation. I heard how one of the FDB founders replaced SQLite's B-tree storage engine with FoundationDB, which turned SQLite into a distributed SQL database. It's much more difficult to make something like that if you had to tackle the hard distributed systems problems.

So this is a really interesting point and I can see why it makes sense if you’ve not built a SQL engine. So why is building SQL on top of a KV store the wrong call? What’s the difference between MySQL’s or Postgres’s or VoltDB’s “storage engine” and what FDB had built?

First, I’m not claiming that putting SQL on top of a KV store like FDB is impossible, just that you’re going to have to either compromise the purity of the KV engine substantially, or you’re going to get slow SQL for anything other than single row CRUD.

It starts with metadata. FDB-SQL stores metadata in the KV store itself. This is great from a distributed correctness point-of-view. It’s also great from a simplicity point-of-view. If I trust the underlying system to be safe and consistent, then my metadata is also safe and consistent. But now I need to do a ton of reads before I run my SQL to know how to run my SQL. Where’s my data, for example? Compare this to VoltDB which replicates metadata to all processing sites, and basically has a second state-machine to ensure that each processing site has the right metadata at the right time. Updating metadata is more expensive, but the fast-path of using the metadata is several orders of magnitude faster.

Then you get to locking. I believe FDB-KV uses per-key read-write-sets to manage concurrency. This choice makes their two-key transaction benchmarks look great, but it scales poorly as the number of keys per transaction grows. SQL basically begs users to write operations that read and write to lots of keys. To make scans fast (even common partial-index-scans), you’re going to need more granular locks, or you’re going to need to relax consistency (ACID, not CAP).

Now we get to storage efficiency. If I have a SQL relation with a primary key and other columns, do I split it so the primary key is in the “key” and the other columns are in the “value” in the KV store? Do I store the whole record in the “value” and loose some efficiency? What about relations with no primary key? Say I’ve got a table that facilitates a many-to-many relationship and it’s just a set of integer pairs. How do I store that efficiently in a KV store? And what about that pesky metadata? Does the key need to include the identifier of the SQL relation (table)? Of course it does.

Secondary indexes. Oh man. To do them well on top of a KV store, you’re going to need pretty strong consistency to ensure that index records point to the right base record, and that there are no base records without an index record. This rules out the aforementioned trick of relaxing consistency to get faster many-key transactions. It’s also a metadata problem; I’ve got metadata I need to read to understand how to use the secondary index. More than that, when I update a record in the base relation, I’m going to need to find all affected secondary indexes and make sure they reflect the new information. That might mean lots of reads to the system to query metadata to update the secondary indexes. And again, we have the efficiency problem where I have to put a bit of extra stuff in each “key” to identify the secondary index the KV-pair belongs to. So secondary indexes have locking/consistency, storage efficiency and metadata problems in this model.

Finally you get data transfer. This is probably not a broadly applicable problem, but I think in the FDB implementation, there is a lot of needless data transfer between KV cluster nodes and SQL processing nodes. Too much data needs to be collected and processed, rather than pushing down that processing to the data’s resting location. In FDB-SQL, if I just want one value from a large record, do I have to move the whole record over the network? Most SQL systems build processing DAGs for each SQL statement and can in-process stream between nodes, or have efficient temp tables to buffer intermediate results.

This falls out of making the SQL layer so separate. In fact, I think the team working on SQL was almost a completely separate team in Massachusetts than the KV team in Virginia. If the SQL layer worked, then I can see how awesome this would be. The “Layers” model just sounds so appealing. It’s just technically quite hard.

So yes, I agree that under most SQL-relational systems, there is a storage engine that smells much like a KV store. Still, that system is much less “pure” than what they were going for at FDB. Locking, metadata, secondary indexes, native understanding of relations, moving the processing to the data — all critical to do right to get reasonable performance.

If FDB had more time to continue on the SQL path, I imagine it would dictate much deeper hooks into the KV side of things. I’m sure with time it could get a lot better. There’s a lot of research addressing some of the tradeoffs I’ve mentioned above, especially in the distributed transaction coordination space. Still, it’s always going to be easier to build the storage engine for the kinds of operations you want from day one.