Turning SQLite into a Distributed Database

monstrado · on Aug 21, 2022

This is exactly what the engineers behind FoundationDB (FDB) wanted when they open sourced. For those who don't know, FDB provides a transactional (and distributed) ordered key-value store with a somewhat simple but very powerful API.

Their vision was to build the hardest parts of building a database, such as transactions, fault-tolerance, high-availability, elastic scaling, etc. This would free users to build higher-level (Layers) APIs [1] / libraries [2] on top.

The beauty of these layers is that you can basically remove doubt about the correctness of data once it leaves the layer. FoundationDB is one of the most (if not the) most tested [3] databases out there. I used it for over 4 years in high write / read production environments and never once did we second guess our decision.

I could see this project renamed to simply "fdb-sqlite-layer"

[1] https://github.com/FoundationDB/fdb-document-layer

[2] https://github.com/FoundationDB/fdb-record-layer

[3] https://www.youtube.com/watch?v=OJb8A6h9jQQ

zasdffaa · on Aug 21, 2022

> Their vision was to build the hardest parts of building a database, such as transactions, fault-tolerance, high-availability, elastic scaling, etc. This would free users to build higher-level (Layers) APIs [1] / libraries [2] on top.

That is very interesting and simple and valuable insight that seems to be missing from the wiki page. But also from the wiki page <https://en.wikipedia.org/wiki/FoundationDB>, this:

--

The design of FoundationDB results in several limitations:

Long transactions- FoundationDB does not support transactions running over five seconds.

Large transactions - Transaction size cannot exceed 10 MB of total written keys and values.

Large keys and values - Keys cannot exceed 10 kB in size. Values cannot exceed 100 kB in size.

--

Those (unless worked around) would be absolute blockers to several systems I've worked on.

monstrado · on Aug 21, 2022

This project (mvSQLite) appears to have found a way around the 5s transaction limit as well as the size, so that's really promising. That being said, I believe the new RedWood storage engine in FDB 7.0+ is making inroads in eliminating some of these limitations, and this project should also benefit from that new storage engine...(prefix compression is a big one).

jiggawatts · on Aug 21, 2022

A simple workaround is to store “bulk data” in an external system like blob storage and reference that from the DB.

chii · on Aug 22, 2022

but now transactional guarantees only extend to the id stored in the DB, and not on the external storage.

Therefore, it's possible that the id is invalid (for the external storage) when referenced in the future. I think doing so only adds complexity as system grows.

It would be better to chunk your blob data to fit the DB, imho. It beats introducing external blob storage in the long run.

jiggawatts · on Aug 22, 2022

> but now transactional guarantees only extend to the id stored in the DB, and not on the external storage.

Depends! If the ID is a cryptographic hash, then as long as the blob is uploaded first, then the DB can't be inconsistent with the blob[1].

A Merkle Tree also allows "updates" by chunking the data into some convenient size, say 64 MB, and then building a new tree for each update and sticking that into the database.

[1] With the usual caveats that nobody is manually mucking about with the blob store, that it hasn't "lost" any blobs due to corruption, etc, etc...

CRConrad · on Aug 22, 2022

> With the usual caveats that nobody is manually mucking about with the blob store, that it hasn't "lost" any blobs due to corruption, etc, etc...

Yeah, with those caveats. But how do you make sure they apply? If someone does manually muck about with the blob store, or it does lose blobs due to corruption, then your transaction is retroactively "un-atomicized" with no trace thereof in the actual DB.

zasdffaa · on Aug 22, 2022

If corruption happens then any guarantees by the hardware are voided, and the software guarantees (of durability) which are built o the hardware guarantees are equally voided. So corruption -> dead in the water.

But otherwise I agree.

chii · on Aug 22, 2022

But if you upload to the blob store first, then add to the transaction (in your db insert) with the id (hash or not), what happens if the db transaction fails? You now have to work out a way to delete off the blob external store. Or change your application so that that it doesn't matter if it's left on the blob store ('cept for money).

victor106 · on Aug 21, 2022

How do you get started with FDB? I found it very powerful but couldn’t find good set of instructions on how to setup and scale.

monstrado · on Aug 21, 2022

Running it locally is as easy as downloading and installing. Scaling FDB is a bit more of a challenge partially due to their process-per-core design decision, which coincidently helps make FDB as bullet proof as it is.

foobiekr · on Aug 21, 2022

Where I previously was runs it in production. it's not hard to scale but at some point you will need to have multiple clusters (maxes out in practice at like 50 instances).

It's basically trouble free unless you run below 10% free space on any instance, where things go bad.

monstrado · on Aug 21, 2022

Not sure if I hit those limits, we were at around 100 nodes and over 170-180 processes. The biggest thing we recognized was tuning the number of recruited proxies and other stateless roles. We were doing up to around 400k tps once we tuned those.

tehbeard · on Aug 21, 2022

How bad are we talking? Performance degradation? Or losing transactions and data?

losfair · on Aug 21, 2022

FDB refuses to process writes (degraded write availability) when usable disk space goes below 10%. Nothing will be lost though.

foobiekr · on Aug 27, 2022

The problem here is when you try and recover the cluster by adding nodes - since FDB manages itself using transactions, this becomes very, very slow and painful if you've allowed multiple nodes to get into this state (which, because of balancing, is kind of how you get there).

Basically, fdb is great as long as you avoid this situation. If you do, woe unto you trying to being the cluster back online by adding nodes and hoping for rebalancing to fix things. It will, it is just very, very slow. I don't know if that's true in the current version.

tehbeard · on Aug 22, 2022

Good to know. and it seems to be tuneable with `knob_min_available_space_ratio`[0], as 10% free space on a 4TB drive would be 400GB.... Not exactly hurting for space there.

[0] https://forums.foundationdb.org/t/brand-new-macos-installati...

alberth · on Aug 21, 2022

Don’t forget BedrockDB (built on SQLite) that’s used in production at Expensify.

How it scales as well.

https://bedrockdb.com/

https://blog.expensify.com/2018/01/08/scaling-sqlite-to-4m-q...

glaucon · on Aug 22, 2022

Apologies for minor derail but that blog post explains how they use /etc/hosts in place of DNS and why. These are people who aren't afraid to take the less conventional path (or at least weren't in 2018). I like it, makes me want to work for them !

juancampa · on Aug 22, 2022

Thanks for pointing this out. I've never heard of that approach and now I'm curious to know more. My first thought is that it seems like a very simple way to implement service discovery

edf13 · on Aug 21, 2022

This seems like and incredibly expensive way of achieving their query goals…

> basic specs: > 1TB of DDR4 RAM > 3TB of NVME SSD storage > 192 physical 2.7GHz cores (384 with hyperthreading)

Not to mention the risk of hardware failure.

samatman · on Aug 21, 2022

> To be clear, the above specs would be pointless for most databases, as almost nothing scales to handle this kind of hardware well — and almost nobody tries.

That strikes me as the more interesting takeaway sentence.

It's not that much RAM if you think of it as 24 8-core machines.

Macha · on Aug 21, 2022

For a database? At expensify scale?

It's still feels like a lot.

robertlagrant · on Aug 21, 2022

What is expensify scale?

Macha · on Aug 21, 2022

Ten million users [1] averaging maybe averaging 40 or so interactions per year (filling out an expense report isn't a common task for a lot of those users)? Napkin math of 1.2 qps. Even if you 10x that for backing workflows, and double it because users are more active than expencted, that's still only 30qps.

[1]: https://www.sec.gov/Archives/edgar/data/1476840/000162828021...

alberth · on Aug 21, 2022

Given the time, effort and hard dollar expense they put into development of BedrockDB - I have to imagine they do more than 1 QPS.

Why else would they benchmark to show they can achieve 4,000,000 QPS if all they need is 1 QPS?

robertlagrant · on Aug 22, 2022

While I think the figures might be a little conservative (or might work for the mean, but not the peak) it is a little odd to imagine why an expenses app would need a database that syncronises via a private blockchain to track expenses. It would be interesting to understand the rationale.

tiger3 · on Aug 22, 2022

Sunk cost fallacy?

yencabulator · on Aug 21, 2022

Relevant reading about FoundationDB building a SQL database on top of a distributed key-value store: https://www.voltactivedata.com/blog/2015/04/foundationdbs-le... (That one replaced SQLite's btree, this one puts pages of the btree as values in the key-value store.)

Another approach using FUSE, making arbitrary SQLite-using applications leader-replica style distributed for HA: https://github.com/superfly/litefs (see also https://litestream.io/ for WAL-streaming backups, that's the foundation of this)

phamilton · on Aug 21, 2022

If I understand this correctly, it's similar in design to AWS Aurora or GCP AlloyDB.The underlying storage provides the distributed primitives and the DB itself just reads and writes to it.

Like Aurora, some tweaks to the engine were required, but the core query engine is largely intact.

Has anyone seen postgresql on FoundationDB? Is there anything unique about SQLite that makes it better suited for this approach? One thing that comes to mind is how they were able to do block level locking instead of full db locking. That took some tweaking but probably significantly less than postgresql might require.

samsquire · on Aug 21, 2022

This is interesting..I would like to learn more how they implemented MVCC.

I wrote single machine MVCC in Java and I'm curious if there are other ways of implementing it. One way is event sourcing.

I use an integer to store the latest commit version and I only allow transactions to see versions less than the transaction's timestamp. This is the multiversion part.

The concurrency control part is enforced by checking if the read timestamp of the key is less than the reading transaction timestamp, if so someone got there before us and we abort and restart.

I am thinking how to build the distributed part.

I need a timestamp server the same way Google's Spanner needs TrueTime for monotonic timestamps but also some way of broadcasting read timestamps to detect conflicts between nodes. So I'm thinking of broadcasting timestamp events and using that to detect transactions that have dangerous dependencies.

losfair · on Aug 21, 2022

FoundationDB handles the hard part! It provides monotonic "versionstamps", externally consistent transactions, along with other useful features. I recommend FDB's architecture docs: https://apple.github.io/foundationdb/architecture.html

hrgiger · on Aug 21, 2022

There are plenty of leader election/replication libraries, shouldnt be problem if you pick one and give a shot if you are not confident about your self implementation or just benchmark and see how it goes.

hardwaresofton · on Aug 21, 2022

So I just fell down the rabbit hole of figuring out how to use SQLite with Ceph (turns out a thing called libcephsqlite[0][1] exists) -- awesome to see this new take on distributed SQLite.

The caveats for dqlite and rqlite always felt kind of awkward/risky to me -- in stark contrast to SQLite which is so stable/"built in" that you don't think about it's failure modes. Having to worry about what exactly I ran (ex. RANDOM()) was just a non-starter (IIRC rqlite has this problem but not dqlite? or the other way around -- one replicates at statement level the other at WAL level).

That said though, the biggest sticking point with all this SQLite goodness is how to make sure that certain libraries (any popular extension -- vsv, spatialite, libcephsqlite) were loaded for any application using SQLite -- there seem to be only a few options:

- calling load_extension[2] from code (this is somewhat frowned upon, but maybe it's fine)

- LD_PRELOAD (mvsqlite does this)

- Building your own SQLite and swapping out shared libs (mvqslite also does this, because statically compiled sqlite is a nuisance)

- Trapping/catching calls to dlopen (also basically requires LD_PRELOAD, but I guess you could go custom kernel or whatever)

This is probably the one big wart of SQLite -- it's a bit difficult to pull in new interesting extensions.

I also found this hack[3] which looks quite interesting for building something more general/reusable...

[EDIT] - Also while I'm here, I think FDB is probably one of the most under-rated massive-scale NoSQL databases right now. It gets nearly no press (to be fair because it went closed then open again), but it's casually a massive force behind Apple's services at scale.

[0]: https://docs.ceph.com/en/latest/rados/api/libcephsqlite/

[1]: https://github.com/rook/rook/issues/10689

[2]: https://www.sqlite.org/lang_corefunc.html#load_extension

[3]: https://github.com/cventers/sqlite3-preload

mike_hock · on Aug 21, 2022

Wait, so the one method that isn't a massive hack is the one that's "frowned upon"?

hardwaresofton · on Aug 21, 2022

Ah sorry I didn’t state that clearly — calling the function From SQL (rather than C/FFI) is frowned upon I think

medv · on Aug 21, 2022

Also https://github.com/rqlite/rqlite

It’s supper cool as it does change sqlite.

conradev · on Aug 21, 2022

I really like the idea of using FUSE as SQLite backend, rather than injecting into the process

I love the use case of querying SQLite from a CDN with range requests, because it allows for real “serverless” querying. For example, this: https://github.com/psanford/sqlite3vfshttp

losfair · on Aug 21, 2022

> I love the use case of querying SQLite from a CDN with range requests

Author here. Actually I have a similar idea with mvSQLite. Provide a client-side-queryable API, but read-write instead of read-only. Security can be implemented with a "provenance"-style mechanism: the client proves they reached a page following a valid/allowed path, by presenting the path (along with necessary signatures) to the server. That way we can have "serverless" read-write transactions with table-level security.

conradev · on Aug 21, 2022

Interesting! I had not thought about access controls when all of the pages are mixed together, but that makes sense.

Have you thought about encryption? Is it possible to do client-side symmetric encryption?

nik736 · on Aug 21, 2022

What makes this special compared to rqlite or dqlite? Edit: https://github.com/losfair/mvsqlite/wiki/Comparison-with-dql...

otoolep · on Aug 21, 2022

rqlite author here.

There are a couple of statements on that page I disagree with.

"mvsqlite is a distributed database, while dqlite and rqlite are replicated databases"

I disagree with this statement, and consider its definition of "distributed system" to be incorrect. rqlite[1] is a distributed database. A "distributed system" is simply a system that splits a problem over multiple machines, solving it in a way that is better, more efficient, possible etc than a single machine. rqlite uses distribution to provide fault-tolerance and high-availability. It uses distributed systems technology i.e. Raft, to make rqlite appear up-and-running, even in the face of node failures. That it replicates a full copy of the SQLite database to every node is correct, but that doesn't mean it's not a distributed system. Is Consul a distributed key-value store? etcd? By the definition quoted on that page they are not, but no one would actually agree with that.

rqlite is not just about replicating a SQLite database. I understand what the page is trying to say, but the point is that rqlite is distributed, just for fault tolerance and high-availability.[2]

"(+): mvsqlite runs on a production-grade distributed key-value store, FoundationDB, instead of implementing its own consensus subsystem."

rqlite doesn't implement its own consensus system either. It uses the same Raft consensus code that powers Hashicorp Consul. I don't see how this is any different, in principle, than using Foundation DB's consensus system.

[1] https://github.com/rqlite/rqlite

[2] https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md#rqli...

losfair · on Aug 21, 2022

Hi otoolep, thanks for the corrections! Actually I've just made a few changes to the wiki page before this comment :)

rqlite's readme seems to indicate that it uses Consul only for service discovery, and runs Raft internally?

I think the most important difference here is rqlite runs its "data plane" on a single consensus group/state machine while mvsqlite relies on FDB's distributed transaction system (well at the bottom there's a Paxos but it is only used for metadata coordination). Both approaches have advantages and disadvantages though.

otoolep · on Aug 21, 2022

Correct. What I meant by the reference to Consul is that rqlite uses the Raft implementation at its center and repurposes for use by rqlite.

I can make that clearer.

otoolep · on Aug 21, 2022

And good luck with your project, welcome to the distributed SQLite space. :-)

ComodoHacker · on Aug 21, 2022

Looks like in their view "fully distributed" == "write scalability", which is, of course, very limited view.

otoolep · on Aug 21, 2022

Exactly. Providing write-scalability is big deal, don't get me wrong, and this project looks interesting. The biggest misconception folks have about rqlite is that it increases SQLite write performance, when it's about providing fault-tolerance and reliability.

So projects like mvsqlite are solving a real need. I just disagree with some of its definitions.

EGreg · on Aug 21, 2022

But is it Byzantine Fault Tolerant? From what I understand, RAFT is not. Paxos-based consensus can be made to be.

im_down_w_otp · on Aug 21, 2022

That's nobody's business but the Turks'.

hardwaresofton · on Aug 21, 2022

The approach is different -- both rqlite and dqlite sit "on top" of SQLite and are much more like replication and coordination layers.

IIRC rqlite replicates commands, dqlite replicates WAL frames, mvsqlite intercepts file system calls because it's a VFS implementation [0].

[0]: https://www.sqlite.org/vfs.html

formerly_proven · on Aug 21, 2022

Speaking of "turning SQLite into things it wasn't really meant for", does anyone know of a compressing time-series layer for SQLite?

stereosteve · on Aug 21, 2022

While not sqlite… duckdb is inspired by the design of sqlite

https://duckdb.org/

baq · on Aug 21, 2022

Sounds good in theory and if it really works as a drop in replacement it’s amazing tech, but numbers, please! I can make a beefy Postgres server handle 5-10-20k tps relatively easily, what can I expect from this?

jrockway · on Aug 21, 2022

I don't think distributed databases are aiming primarily at increasing write throughput over a single node. The question is, how many transactions can you write per second from your California datacenter to your New York postgres leader when someone typos the BGP rule controlling that database's IP address? Probably zero. That's the performance number these databases are trying to increase.

Many people can say, "wait, that doesn't matter to me at all, that happens at most once a year and all of my customers are also offline when it happens". Indeed, that's why people are doing just fine with postgres. But, some things are a little more mission critical than average, and so these technologies exist for them.

One thing that does tip the scale towards these distributed databases are operational concerns. "Apply security patch" or "migrate to new major version" look a lot to distributed systems like "tornado took out the data center" or "hard disk turned into many disconnected chunks of steel". So while you might not be super concerned about disasters, you might still be interested in keeping your compute nodes disposable or in doing regular maintenance without downtime. I don't think the operational balance is quite there yet, but it's definitely worth looking into every few months to see what the state of the art is.

baq · on Aug 21, 2022

That would be my point if I was about to make a more elaborate post - I don't expect this to run 20ktps on a geographically distributed cluster, but if it can do 100tps with synchronous replication between east and west coast it opens a different set of possibilities than 5tps. If it can do 20ktps... it makes my life a whole lot easier, which is quite fine by me, too ;)

azurelake · on Aug 22, 2022

You can already do this with MySQL/Postgres/etc., and there's no special magic in any of these newer tools that allow you to avoid the speed of light. The reason people don't typically run WAN synchronous setups is exactly because of that (and network reliability).

The only way to a "free lunch" is via CRDTs.

edit: This is in no way a knock on FDB by the way. I think it's a really interesting piece of technology and I want to explore it more. Just that it doesn't have a way to do consensus faster than the speed of light :)

losfair · on Aug 21, 2022

I'm working on the benchmarks. There shouldn't be surprising results with read scalability, as it is basically guaranteed by FoundationDB. For writes it is indeed more complicated and a benchmark would help here.

tmikaeld · on Aug 21, 2022

Please post the results on HN when you have them, would love to read it

endisneigh · on Aug 21, 2022

I’m interested in how FoundationDB can make anything consistent.

For example: you’re using Postgres. You send the request to FDB, it will ensure all Postgres transactions are consistent or tell Postgres to abort transactions.

lifeisstillgood · on Aug 21, 2022

I am fascinated by how the initial authors of all these packages went from "that would neat if I had 3 months free" to "yeah, doing it now"

EGreg · on Aug 21, 2022

How does this compare to, say, CockroachDB?

Joel_Mckay · on Aug 21, 2022

I suspect because most of CockroachDB is written in Go, that it performs much better. Of course, this assumes no sequential keys or auto-increment indices (global state interdependence chokes distributed dbs to a crawl).

didip · on Aug 21, 2022

I am surprised that SQLite can be plugged into FDB just like that.

abujazar · on Aug 21, 2022

But why?

solarkraft · on Aug 21, 2022

I've seen a lot of Sqlite hype here in the past weeks and months.

Please excuse my ignorance, but what do all these Sqlite-but-make-it-X solutions offer over a simple, more established solution like (nobody ever got fired for choosing) Postgres?

mrkurt · on Aug 21, 2022

You need to consider some of what sqlite is good at. Querying local sqlite is very fast. Much lower latency than querying postgres. It's also very reliable – a local file will always work better than a network service.

I'm pretty into Litestream/LiteFS. Here's what I'm after:

1. Operational simplicity. Fly.io devs run small app servers in a bunch of places. They're usually read heavy. Running network database servers gets very complicated, very fast. You need a DB node in each place your app server lives.

2. Graceful failure. When you scatter app servers around the world, internet weather causes problems for an app. I want my DB to be ok when that happens. Reads should continue to work, if they can. And writes should fail in a way that makes it obvious what's happening.

3. Good for caching. Most fullstack apps use Postgres and then layer in Redis/Memcached for caching. This is yet another moving part. sqlite has amazing performance for cache workloads.

These all make an embedded DB interesting. If Postgres fails, your app server needs to know that Postgres failed and then also fail. Same if you add Redis. If your embedded DB fails, it's obvious to the app server that something is awry.

Another thing I'm after is drop in usage without touching code. We run a lot of peoples' apps and have limited ability to have them add libraries or write new code. Almost all of their frameworks speak sqlite, though.

Adding clustering to sqlite is not perfect. There are still networks to deal with and things will still break. All it's doing is shifting complexity and giving us different levers to use to keep things reliable.

losfair · on Aug 21, 2022

Hi mrkurt!

Litestream/LiteFS are amazing projects. The FUSE-based approach is interesting (I'm implementing something similar in mvSQLite, thanks for the idea!)

> Graceful failure

mvSQLite is designed to continue to operate under degraded network (there is a fault-injection test specifically for checking this property: https://github.com/losfair/mvsqlite/blob/1dd1a80d2ff7263b07a...). Network errors and service unavailability are handled with idempotent retries and not exposed to the application.

> Good for caching

mvSQLite caches pages read and written, and does differential cache invalidation (only remotely modified pages are invalidated in the local page cache). The local cache is just a regular KV store with invalidation strategies, and can be moved onto the disk. So it essentially becomes a consistent local database snapshot.

mrkurt · on Aug 21, 2022

mvSQLite checks a lot of my boxes. The FoundationDB requirement makes it suboptimal for tiny little node/rails/laravel apps, though. If we were in the business of hosting FoundationDB it'd be perfect.

CoolCold · on Aug 21, 2022

I've got impression this approach is heavily focused towards tiny DBs with small amount of concurrent writes and single digit teams behind the project.

For something larger, finding out writes to DB fail is not a big deal - apps have logs after all.

Another thing captured my attention is DB sizing/fault tolerance

1.keeping average sized DB of 100-300 GB locally, without much buffer pool/vfs cache with higher chances will slowdown the reads and joins. Something bigger like couple of TB should be even harder to do.

2. Syncing multiple concurrent writes can be even longer or on par - guessing here.

3. Both Redis and Postgres (and MySQL) can be (should be) kept behind LB. If node fails, clustering + LB will handle it. LBs of course can be present per location and provide local reads and forwarding writes to remote.

Regarding LB, probably simplest would be have different tcp ports for readwrite/readonly types of load. Reads can be split among multiple replicas be needed.

I have not used it for anything big, but from my experience small/average load (80k rps to Redis with 3Gbit of bandwidth) LB was not an issue. 500k rps may change the situation of course, not a silver bullet.

arinlen · on Aug 21, 2022

> Please excuse my ignorance, but what do all these Sqlite-but-make-it-X solutions offer over a simple (...)

I can't speak for all, but keep in mind that SQLite does not require a server or sends requests over a network to get to the data. This means SQLite is a far simpler and cheaper deployment, which is totally acceptable if you don't have high reliability and horizontal scalability in mind. Sometimes it even outperforms production RDBMS.

Bolting on distributed access to SQLite adds horizontal scalability for (in the very least) cases where eventual consistency is more than enough to meet your requirements.

otoolep · on Aug 21, 2022

Can only speak for rqlite: https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md#why-...