Mycelite: SQLite extension to synchronize changes across SQLite instances

8organicbits · on June 26, 2023

    db.commit()
    
    # Give Mycelite time to sync before script exits
    import time
    time.sleep(5)

This really needs a way to wait for the sync to complete. Sleep won't cut it.

baliex · on June 26, 2023

Relevant HN post from _yesterday_, https://news.ycombinator.com/item?id=36465085

eli · on June 26, 2023

That’s just the example client code, not the sync code. It’s showing you how to demo that the sync works. You’re not expected to use that in production.

https://mycelial.com/docs/get-started/python#observing-synch...

Hamuko · on June 26, 2023

What's the code that you're supposed to use in production to wait for sync then?

d_watt · on June 26, 2023

I mean to be fair, even in using something like Postgres logical replication, there's no way to verify whether a particular change has replicated (minus xmin hackery).

If I were to write a tutorial on how to set up PG replication and "observe the replication", I think that would probably involve "write to master, sleep for a bit, then see the data is in the secondary."

8organicbits · on June 26, 2023

The problem is that Postgres is a long lived server so the app can fire a commit to the DB and then exit. Here, the app needs to perform the sync, so you can't exit. I didn't see a way to confirm the sync was complete either, so when can you safely exit?

bhouston · on June 26, 2023

Couldn't you just query the replication log positions on the primary and standbys and calculate their delta?

I asked this to ChatGPT 4 and it suggested something like this that seems reasonable:

Get it from the main server: SELECT * FROM pg_stat_replication, and the write_ls column has the log transaction location; Then get the location of the replication log and how much has been received: SELECT pg_last_wal_receive_lsn(); Then you can figure out the delta from any of these: SELECT pg_wal_lsn_diff(write_ls, pg_last_wal_receive_lsn()); or similar.

That seems like it isn't a hack.

beebmam · on June 26, 2023

I'm personally setting a 5h sleep here in this thread to wait for a relevant response to this

OJFord · on June 26, 2023

You wouldn't 'wait for sync', you would just be continually running the application loop, and it would meanwhile be synchronising?

This is just demo code, run these two short scripts to see a synchronising example, at least that's my reading.

tyingq · on June 26, 2023

I think Mycelite is either very early in it's life, or is just an experiment. The replication code[1] has this at the top:

  //! Replicator prototype
  //!
  //! ** For demo use only!

And the configuration suggests it sends some data to https://us-east-1.mycelial.com [2] And that's not really explained anywhere I can see.

I don't think the authors would currently suggest using it for anything real.

[1] https://github.com/mycelial/mycelite/blob/main/mycelite/src/...

[2] https://github.com/mycelial/mycelite/blob/main/mycelite/src/...

cryptonector · on June 26, 2023

Sync notifications need to be async. That goes for `fsync()`/`fdatasync()` as well as for `COMMIT`. In the case of a database-using application you can report transaction non-failure (as in: it was well-formed and accepted) in a UI separately from transaction durability success (i.e., it really finished), and if you have replication to worry about, you can also report replication success separately.

Every blocking operation needs an async variant. Everything should be async-first.

gnulinux · on June 26, 2023

Fingers crossed there is no network issue. `time.sleep(5)` is ridiculous.

peteradio · on June 26, 2023

Then just time.sleep(10) bro, it aint rocket science.

tarruda · on June 26, 2023

I would use time.sleep(60) just in case.

peteradio · on June 26, 2023

time.sleep(300) allows for a smoke break per sync, please consider my PR

edvinbesic · on June 26, 2023

That depends on what you're smoking, let's just be safe: time.sleep(3600)

peteradio · on June 26, 2023

3600 seconds! That's almost 4 hours!

2143 · on June 27, 2023

No it's not; 3600 seconds is 1 hour!

peteradio · on June 27, 2023

I said "almost"!

intelVISA · on June 26, 2023

Entire field of distributed systems solved by this one trick!

hahn-kev · on June 26, 2023

I see so this is a PoC hobby project and nothing to actually use. Thanks for the warning.

chaxor · on June 26, 2023

RATML · on June 26, 2023

Those silly Internet Protocol guys thought we'd use UDP multicast for such things. That was before everybody knew that "cluster" was just a code word for 5 VMS 1-2 core instances scattered between continents, laptops, and cloud providers.

sim7c00 · on June 26, 2023

where did u find this python stuff..

Languages Rust 100.0%

did the link change??

Hamuko · on June 26, 2023

https://mycelial.com/docs/get-started/python#observing-synch...

dangoodmanUT · on June 26, 2023

Link?

paulgb · on June 26, 2023

https://mycelial.com/docs/get-started/python/#observing-sync...

sho · on June 26, 2023

Be interested to hear a comparison between this lib and litestream/litefs, which seem to be actively developed by fly.io for a similar use case

https://litestream.io/ https://github.com/superfly/litefs

benbjohnson · on June 26, 2023

Litestream/LiteFS author here. It's been a while since I looked at Mycelial but IIRC it does eventually consistent replication using CRDTs. That's great for systems that need multi-master but there are consistency tradeoffs that you have to watch out for.

Litestream & LiteFS are both physical, async replication tools so they actually work at the page level. When a transaction commits, the changed pages are shipped off to a replication destination. By replaying those changes in the same order, you can rebuild the current state of the database (or any point-in-time state in between). Both of these systems require a single primary node at a time so multi-master doesn't work with them.

Litestream only replicates to "dumb storage" (e.g. S3, SFTP) whereas LiteFS replicates to other live nodes so you can have live replicas.

AlexErrant · on June 26, 2023

cr-sqlite [0] is another option. It has the added benefit of being able to run in the browser (persisting to IndexedDB a la absurd-sql, though it uses a fork of wa-sqlite [1]). In theory it could also persist to the Origin Private Filesystem via SQLite's official impl, though switching doesn't seem to be strictly better [2]. A con is that it doesn't enforce foreign key constraints (though of course you can still do joins). Cr-sqlite also has better documentation about what CRDTs it supports than Mycelite [3], which is something I brought up in a recent thread [4](which might have triggered this post). I'm personally using cr-sqlite in an Anki clone I'm building [5] and have only positive things to say about cr-sqlite and its maintainer.

[0] https://github.com/vlcn-io/cr-sqlite/

[1] https://github.com/rhashimoto/wa-sqlite/

[2] https://github.com/rhashimoto/wa-sqlite/discussions/63

[3] https://github.com/vlcn-io/cr-sqlite/#how-does-it-work

[4] https://news.ycombinator.com/item?id=36473772

[5] https://github.com/AlexErrant/Pentive

sho · on June 26, 2023

Thank you for the link, very interesting. A lot of work going on in this space right now!

tedunangst · on June 26, 2023

Litestream lets you pick which servers you replicate your data to.

andrewmunsell · on June 26, 2023

I’m unsure why this would be compared with Actual Budget or CRDTs in the README.

The issue that CRDTs are solving isn’t simply getting the bytes from one device to another, it’s ensuring that conflicts are merged in a sane way, which this project doesn’t clearly address. Simply syncing a DB across multiple devices won’t help if your app isn’t designed to handle data conflicts.

fauigerzigerk · on June 26, 2023

>The issue that CRDTs are solving isn’t simply getting the bytes from one device to another, it’s ensuring that conflicts are merged in a sane way

CRDTs merge data in a predictable way but not necessarily in a consistent or correct way according to the requirements of any particular application.

I find CRDTs interesting and they are certainly useful for some types of applications with relatively lax consistency rules. But I don't see how they help with the sort of consistency requirements that finance apps are bound to have.

robertlagrant · on June 26, 2023

Isn't the point that you get to choose the merge strategy, and thus it is done according to your requirements?

fauigerzigerk · on June 26, 2023

Developing each new CRDT including its merge strategies is a fundamental research problem.

How likely is it that the right one exists (or that you can combine a few of them) to express some arbitrary, application defined consistency rule or merge strategy?

On top of that, any solution would have to agree with all your other requirements, such as your chosen degree of normalisation, your performance requirements, etc.

nwallin · on June 26, 2023

From my understanding, that was OP's point. This project isn't a research project, this project is a one size fits all coding project that purports to solve problems that you need a research project on a single domain in order to solve. It is not conceivable that this project could solve the problem that a CRDT is meant to solve any more than you can parse HTML with regex. But in the readme is explicitly compares itself with CRDTs and a program that uses CRDTs to solve its problem.

The question OP is asking is, why is it comparing itself to CRDTs and Actual Budget? If Mycelite has special sauce, what is that special sauce? If it doesn't have special sauce, do they understand they're attempting to solve a problem which requires special sauce?

andrewmunsell · on June 26, 2023

You hit the nail on the head— I don’t mean to imply that CRDTs are perfect or useable for every use case, but they’re a way of handling sync conflicts.

It is just unclear to me how this project does so, and if it doesn’t, then why it even references CRDTs in the first place

fauigerzigerk · on June 26, 2023

I read OP as implying that CRDTs solve the problem while they are questioning whether this project does. My response to that is to question whether CRDTs solve the problem in the general case (although they may well be helpful in a specific case).

Void_ · on June 26, 2023

We've been building iOS/macOS apps in local-first fashion for years.

It works differently - there's a persistence library called Core Data on each client (most recently SwiftData). You can choose which database to use - most likely SQLite, but you could use XML files to store your data.

And then Core Data simply makes sure everything is in sync. It's an Apple system library, which means your data will automatically sync even if the app isn't running.

This feels like better approach than OP's Mycelite or better-known cr-sqlite.

Any reason why syncing on the database level would be better? It feels like something the library should do, not the database.

manmal · on June 26, 2023

CoreData is a footgun when it comes to threading, my approach is to use a wrapper lib (eg CoreStore) that takes off the sharp edges, or just use SQLite directly (or with a lib like GRDB).

jamil7 · on June 26, 2023

To be slightly pedantic and clarify. It's not actually anything to do with Core Data itself but CloudKit that is doing the syncing, more specifically, NSPersistentCloudKitContainer which came out a few years ago and makes adding CloudKit syncing to Core Data a one-liner basically.

This year at WWDC we got CKSyncEngine which gives you the primitives to add a CloudKit syncing to your own persistence layer.

Void_ · on June 26, 2023

Does that mean we could for example syncing Core Data with Firebase? That would allow to sync with a web app, exciting!

gregoriol · on June 26, 2023

CoreData doesn't work on any other platform, so you can only make Apple-apps. If you wanted to have a web version, an Android version or whatever else, you couldn't. So it's not a very good choice for most use-cases.

anonzzzies · on June 26, 2023

Does something like that exist for other platforms? Web, Android, Linux?

jinjin2 · on June 26, 2023

We use Realm for this. Very similar to Core Data, just removing some of the obvious footguns, but it works across both iOS and Android, and the real-time syncing is excellent.

jamil7 · on June 26, 2023

I haven't used it in a long time but we had a lot of issues with random crashes showing up in logs and the compile-times are pretty bad. Has this gotten better since the Mongo acquisition?

jitl · on June 26, 2023

CoreData is fine for syncing a single user’s data between their devices. Something like cr-SQLite handles the more challenging case of syncing collaborative data between multiple users.

ungamedplayer · on June 26, 2023

Simply

laurent123456 · on June 26, 2023

It's funny how local-first is now considered "a new type of application". Ten years from now there will be a blog post here presenting a revolutionary type of application - "online first", to reduce the need of all this annoying complexity, synchronisation, multiple clients, etc.

api · on June 26, 2023

In the beginning there were mainframes and VT100 dumb terminals. (Well, after punched cards and switches.)

Then we got personal computers.

Then we got mainframes again with HTML5 dumb terminals, rebranded as "cloud."

Now we're starting to see renewed interest in personal computers.

chii · on June 26, 2023

> Now we're starting to see renewed interest in personal computers.

'cept this time round, those slick black rectangles are so locked down you can hardly program on them. And not to mention that people using them are mostly just looking at tiktok, twitter, instagram, snap, youtube.

api · on June 26, 2023

The number of technical people today is vastly larger than it was back then. Their number just looks small because there's a new phenomenon of mainstream casual computing and those numbers dwarf anything else.

Personally I don't consider the black rectangles real computers. They're consumer content consumption devices, basically like televisions. They are to computers what a TV is to a production studio.

eddieroger · on June 26, 2023

It's a weird way to phrase it, but "local-first" is different than "local-only", which was my gut reaction to the phrase. I think as a flip over the remote-then-sync model, it's apt, but could probably use some rebranding.

account-5 · on June 26, 2023

Haha, yeah, exactly what I thought when I read that.

tedunangst · on June 26, 2023

This appears to rely on a cloud service they're running, but that's not prominently mentioned.

dolmen · on June 26, 2023

https://hub.mycelial.com/signup

codeulike · on June 26, 2023

These days quite a lot of our problems are synchronisation problems (are you pulling mutable data regularly from a remote API and storing it? thats actually a syncronisation) but despite that synchronisation and how complicated it can be doesnt seem to get discussed. You don't hear debates about different approaches, we dont seem to have a taxonomy of different techniques.

People seem to forget how hard it is. So lets say you're pushing changes from one system to another. How are you detecting changes? timestamps or a direct diff? Or checksums? Have you remembered to think about deletions? Can you even detect deletions? How do you keep track of which bits have successfully synced and which havent? If something unusual goes wrong (network goes down) halfway through a sync can you recover from that? Incremental sync inevitably drifts over time due to bugs and downtime, so are you also building a full-scale wipe-and-refresh or compare-and-reconcile to accompany the incremental sync? And dont even get me started on two-way syncs and conflict resolution and how hard that is to do without some sort of review-by-human component.

dathinab · on June 26, 2023

I think the reason why it's not discussed that much is because a lot of software gave up on properly synchronizing changes and go with a "handwaving eventual consistency" approach for a lot of things.

Furthermore decisions about which db is used are rarely driven by the synchronization model. Except maybe sometimes the question "eventual consistency yes/no/a bit only".

And in academia (wrt. common use-cases) we do have a bunch of well analyzed ways to archive synchronization the main question often isn't one of academic research needing to be done but of which model with which parameters and implementation details on top to use for a specific kind of db. Which mainly is discussed by devs when writing new dbs, i.e. not a everyday thing. While for the many new sqlite sync approaches the answers is nearly always "handwaving eventual consistency", cause nothing else would work on the edge anyway.

jgaa · on June 26, 2023

> "handwaving eventual consistency" approach for a lot of things.

Which, for some systems translates to: It will be in sync when all the entropy in the universe is gone.

dathinab · on June 26, 2023

it's also a bit of an red hearing argument

sure some systems under correct operation will never ever be fully in sync as long as they are used

but most systems also do not need to be ever fully in sync

it's good enough that for a specific context they will be in sync in not to much time if that context stops changing

and that is something they do provide

e.g. after updating a JSON blob stored under a specific id that update will be eventually available in the not too distant future and if no future changes to that document happen then in the context of that document the system will be fully in sync. But because you have very man documents there will always be a document which isn't yet in sync and in turn the system as a whole will never be fully in sync. But in the end that doesn't really matter.

Also: no computer system will live that long, and if they aren't put in sync before they stop working they will literally never be fully in sync.

Piskvorrr · on June 26, 2023

TLDR: make it a Someone Else's Problem, in other words "not synced? Our code works as designed, go bother Support, or whatever."

dathinab · on June 26, 2023

no its not somones else problem

it's still your system

but a system as a whole never being fully in sync is a red herring argument which misses the point and focuses on an aspect which might sound like a problem but hardly ever is a problem at all

keybits · on June 26, 2023

Also in this space is ElectricSQL https://electric-sql.com/

I haven't had a chance to compare yet.

ochiba · on June 26, 2023

Also PowerSync: https://www.powersync.co/ (disclosure: co-founder)

rpastuszak · on June 26, 2023

I’m just messing with using SQLite with serverless on Vercel and really like the simplicity of having just a db file you can read locally instead of setting up and connecting to a db instance somewhere else.

This wouldn’t work ofc as fs access in Vercel is limited, plus syncing the db would (probably) delay the network response. But

oefrha · on June 26, 2023

A few sites of mine are backed by SQLite databases replicated daily from another source. Right now I’m rsyncing them, but sending an even increasing number of gigabytes a day isn’t great for the long term. Anyone aware of some sort of checkpointing and incremental sync solution for SQLite? Doesn’t need to be real time, and shouldn’t be real time, really. This one isn’t what I’m looking for unfortunately, the “sign up for a free account” part is also pretty crazy.

pkhuong · on June 26, 2023

Verneuil https://github.com/backtrace-labs/verneuil might work for you. Its replication client can use a persistent chunk cache (no fsync, so doesn't survive reboots), and there's a CLI utility to reconstitute a sqlite db file (`verneuilctl restore`). It also tries to do the right thing by writing to a temporary file before renaming it over the old db file.

On the write-side, you can load the verneuil VFS as a runtime extension. There's a similar gotcha with reboots, where the replication state is persisted but not fsync-ed, so you get a full refresh on reboots. The S3 storage is content addressed though, so you "only" pay for useless bandwidth and API calls, not storage.

benbjohnson · on June 26, 2023

The SQLite Session extension[1] is one option for shipping row-level change sets between databases. It may be more efficient than rsync and you can do it in batches.

Can you talk more about your use case? I'm the author of LiteFS which does real-time replication of databases. It doesn't sound like it's a fit for your use case but I'm always interested in hearing about what people are doing.

[1]: https://www.sqlite.org/sessionintro.html

waz0wski · on June 26, 2023

At this point, why not switch to using a real database like mariadb or postgresql? These days you don't even have to run it yourself, every cloud provider offers it as a service and can handle backups, patching, replication for you

djbusby · on June 26, 2023

You may want rqlite?

otoolep · on June 26, 2023

rqlite creator here. rqlite is not simply a SQLite replication technology[1], so probably won't meet your needs. Maybe litestream.

[1] https://rqlite.io/docs/faq/#can-i-use-rqlite-to-replicate-my...

AlexErrant · on June 26, 2023

Or Litestream.

jdthedisciple · on June 26, 2023

Need more info.

How "real time" is the synchronization?

What are limitations?

gizzlon · on June 26, 2023

and what happens on a write? Does it wait for X nodes to ack the write (slow) ? Or does it return right away and risk some data loss?

valstu · on June 26, 2023

I wonder how this compares to https://vlcn.io?

AlexErrant · on June 26, 2023

I commented on another thread about cr-sqlite [0]. In addition to that, I believe Mycelial is VC funded, whereas Matt has Github sponsors. I hope there's a future where he teams with, say, Fly.io to make cr-sqlite sustainable.

0: https://news.ycombinator.com/item?id=36475514

tantaman · on June 26, 2023

Getting there! fly.io is using cr-sqlite these days :)

jdthedisciple · on June 26, 2023

Does this work across networks?

E.g. for an sqlite db that lives on a Network Attached Storage (NAS)?

nsonha · on June 26, 2023

How is this that neither the document, nor the comments here address the issue with permission aka one user potentially corrupt other users data

TickleSteve · on June 26, 2023

This is sqlite, not a db-server like mysql, its an in-process db that forgoes those sort of issues in the name of performance and lightness.

nsonha · on June 26, 2023

yes it did that because sqlite is (was) an embeded db, the whole point of projects like this is to elevate sqlite from that status? NO?

constantly · on June 26, 2023

eyegor · on June 26, 2023

I wish sqlite was performance focused. Perf is pretty much always worse than a db server, even after tuning the pragmas (the way it locks tends to limit single threaded perf and concurrency). Main benefit of sqlite is just its portability since you can easily embed it. It's nice for slapping a db backend on a client installed app, where trying to spin up a db server would be problematic.

There's probably an extension for permissions in sqlite but out of the box the db files are plaintext (you need an extension for that).

Good starting point for extensions: https://antonz.org/sqlean/

matharmin · on June 26, 2023

In my experience, SQLite has excellent performance if you use it correctly. It can easily do 10k to 500k inserts/second, depending on the workload.

SQLite only supports a single writer at a time, but does support concurrent reads. There is also an experimental branch that supports concurrent writes.

littlecranky67 · on June 26, 2023

Any numbers to support that claim? SQLite performs pretty well when just reading. Only write-intensive workloads are usually slower than on pgsql and others.

fodkodrasz · on June 26, 2023

the SQLite db files are definitely not plain text. (only in a cryptographer's jargon, but even that is not true, because they might be encrypted internally using an extension)

ochiba · on June 26, 2023

With the right approach, performance can be good. My co-founder wrote this overview of SQLite performance optimizations: https://www.powersync.co/blog/sqlite-optimizations-for-ultra...

HumanOstrich · on June 26, 2023

SQLite is absolutely focused on performance and excels at it for its intended use cases. You can read about them here: https://www.sqlite.org/whentouse.html

TZubiri · on June 26, 2023