The database I wish I had

jeswin · on Sept 1, 2020

Databases we have today aren't designed for the kind of consumer applications we use today. They are designed for banks, Amazons and their smaller counterparts - where a central entity (the business or corp) needs full control over the data.

The most popular consumer apps (FB, LinkedIn, blogs, photo sharing, goal trackers, slack, basecamp etc) shouldn't be centralized in this fashion. Data(bases) should be owned by users, and synced with friends and colleagues. So instead of a single centralized database, we need a million tiny little databases which are synced with peers.

Tiny databases fit well into today's hardware capabilities landscape. We have plenty of fast storage, and powerful CPUs on computers and phones. It's time to stop building as if the server is the only place where queries can run efficiently. Truly serverless, that is. This also improves data security in general; there is no central place to hack.

Happy to talk about this in detail. Please feel free to email me.

macspoofing · on Sept 1, 2020

>Databases we have today aren't designed for the kind of consumer applications we use today.

Yes they are? Relational databases are very good for consumer applications with relational data. NoSQL databases are good for everything else.

>Data(bases) should be owned by users, and synced with friends and colleagues.

Why? You use words like should - why SHOULD it be that way? Users are able to create / buy / rent infrastructure to manage their own data - and yet they don't ... why?

>Shouldn't be centralized in this fashion. Data(bases) should be owned by users, and synced with friends and colleagues. So instead of a single centralized database, we need a million tiny little databases which are synced with peers.

Nice in theory (maybe). The problem is that centralization provides enormous quality and performance metrics. Distributed systems are resilient, but slow, and require a lot of care to maintain consistency and integrity of the network. Take email as an example. It took us 20 years to get email spam somewhat under control and a major part of that was creating a centralized infrastructure (from using centralized email providers to third party whitelists and blacklists). It's gotten to the point where hosting your own mail server is a huge hassle and pretty much impossible for regular people. I spent about a week trying to figure out why our Office365-originating mail wasn't being received by one of our customers. It turned out, their mail provider matched a particular phrase in our automatically-inserted disclaimer and completely rejected the mail.

Decentralized systems also make controlling your data all but impossible. If I want to remove my data from Facebook, that's the only entity that I need to deal with and it's relatively easy for regulators to set sound policies in place. In your distributed system, once your data leaves your node, you've lost all control all of it forever.

hansvm · on Sept 1, 2020

> Decentralized systems also make controlling your data all but impossible. If I want to remove my data from Facebook, that's the only entity that I need to deal with and it's relatively easy for regulators to set sound policies in place. In your distributed system, once your data leaves your node, you've lost all control all of it forever.

That's already the case though. When I ask Facebook to delete my data they delete their copies of it. However, anyone who has downloaded photos, any advertisers who have scraped that data, any archival attempts, any search results, and a whole slew of other partial copies of that data still exist. It's almost better for data to be explicitly distributed to make it more obvious that it can't be easily removed.

viraptor · on Sept 1, 2020

> where a central entity (the business or corp) needs full control over the data.

Or where we want some central control. Until we solve decentralised spam, I really don't want to share databases directly with other users. If you can generate a new identity yourself and then inject your own content into other databases, that's a recipe for pain.

Federation kind of works, but I can't imagine a fully distributed slack for example. I'm on matrix and get the "elon musk wants to share bitcoin" images every day from telegram bridge.

Even in email, spam is still not really solved. I still get both false positives and false negatives.

> This also improves data security in general; there is no central place to hack.

I see your point here, but would also counter with: degraded security because we need N databases to patch every issue after the issue is public, rather than 1 database (potentially) before the issue is public.

jeswin · on Sept 1, 2020

> Even in email, spam is still not really solved. I still get both false positives and false negatives.

See my comment about Secure Scuttlebutt below. Email is a little different, in that complete strangers (and apps) should be able to reach you. However, that's not the case with FaceBook, Slack, Instagram etc. If you sync exclusively with people you know (or your organization) and their connections, spam won't exist by design.

What you gotta do is to model online communities similar to the physical world. In meat space, you choose the people who you are accessible to. That'll work here as well.

> I see your point here, but would also counter with: degraded security because we need N databases to patch every issue after the issue is public, rather than 1 database (potentially) before the issue is public.

WASM-based db engine could be a good option, if the app is browser based. We could treat it as part of the app, and the standard security considerations apply.

sitkack · on Sept 3, 2020

SQLite running in wasi with a sync protocol, maybe like couch, or a CRDT depending on multiple writers.

When I use an application on the web, it should save data into my database not the app providers.

aabbcc1241 · on Sept 15, 2020

The approach adopted by zeronet maybe what you're looking for that allows peers to synchronize their data in master-less manner.

viach · on Sept 1, 2020

> The most popular consumer apps (FB, LinkedIn, blogs, photo sharing, goal trackers, slack, basecamp etc) shouldn't be centralized in this fashion

The evil CAP theorem enters the town here. Consumers want things to work fast, it doesn't fit the decentralized approach.

> Databases we have today aren't designed for the kind of consumer applications we use today

Consumers don't really care what kind of database do you use. They want to open the page and click the button, quick, now.

jeswin · on Sept 1, 2020

> The evil CAP theorem enters the town here. Consumers want things to work fast, it doesn't fit the decentralized approach.

I think we have to distinguish 'decentralized' from 'distributed' in this case. In a decentralized approach we reduce (by many orders of magnitude) the size of data we're dealing with. Locally installed/browser-based applications only have to deal with data that's created by, or directed at that user (and locally stored).

Add: For a number of applications, hard consistency guarantees are overkill. You could totally live with eventual consistency (some don't even need that). In fact, other than e-com and banking I can't think of any of my daily-use apps needing such guarantees. And indeed, decentralized data storage is a bad fit for banking and commerce.

eternalban · on Sept 1, 2020

Network hops cost time regardless of payload size. Distributed consensus (which is what is required for a “decentralized” system) requires interaction between distinct nodes.

You should note most consensus algorithm designs start with sharing a single ‘register’ (or ‘value’).

Most banks were doing “decentralized” relaxed consensus from the beginning of “banking”. Local office accepts your deposit of your check. “Funds available next day”. Because by “next day” they have confirmation from counter party for the funds.

aabbcc1241 · on Sept 15, 2020

In your example, bank system can be seen as employing eventual consistency approach.

pdimitar · on Sept 1, 2020

You are correct but part of this can be alleviated by aggressive caching by the client devices and delayed (optimistic) updates -- to a large extent though. Doesn't apply to all problems in this space.

ric2b · on Sept 3, 2020

The C in CAP isn't required if we're assuming that there's an explicit "sync" that the user only expects to happen occasionally.

beagle3 · on Sept 1, 2020

FB, LinkedIn and Slack are exceptionally happy that it is centralized; what do you mean “shouldn’t be centralized”? Who cares about what’s better for the user? He’s not the customer anyway.

Sadly.

jeswin · on Sept 1, 2020

> Who cares about what’s better for the user?

All of these apps chose technology that was available to them when they started building. So the focus could be to enable building modern usecases on decentralized data platforms.

If one of them becomes the next FaceBook, personal ownership could become a baseline expectation for similar apps. Once data is owned by users, apps become a layer on top of data which can be substituted as needed.

Edit: simplified.

lozenge · on Sept 1, 2020

Decentralization has its own issues - spam filtering, lack of moderation (people often do actually want it), anonymity.

jeswin · on Sept 1, 2020

Sure. But there are good examples of decentralization making spam difficult or impossible.

One such is Secure ScuttleButt (SSB). You sync with friends and their friends (and their friends maybe, based on configuration). Now people aren't generally friends with spammers so you don't see their feeds. And if they are, a sufficient number of your friends would be blocking them - once a person is blocked by a certain number of people, stop syncing with them.

I am not convinced that SSB is a solution for all types of apps, but for Social Networks it works darn well. Oh, and zero ads too because you'd never sync with ad networks.

aflag · on Sept 1, 2020

I don't know. Facebook can spot fake profiles because of the centralisation, not sure how that'd be avoided in that system. Also, if users are really anonymous, the system will end up turning into a cesspool of disgusting and illegal stuff. At least that was the fate of all peer to peer social networks I'm aware of.

curryst · on Sept 1, 2020

That would be handled client-side, like email. You could build in a standard set of rules, and then have some kind of ML process profiles you manually mark as spam. Just have it filter profiles in API requests through the fake profile detector.

> At least that was the fate of all peer to peer social networks I'm aware of.

That doesn't seem true of Mastodon. The few servers I've been on were fairly tame. I think the worst I saw was someone posted a sexualized anime-style image, but there wasn't any nudity.

I didn't immediately spot anything weird on Diaspora either, but I only spent like 45 minutes on there.

> Also, if users are really anonymous, the system will end up turning into a cesspool of disgusting and illegal stuff.

Anonymity can be okay, but only in the presence of heavy moderation. HN is basically anonymous if you want it to be, but it stays on target because of the moderation. You can see it on Reddit too. AskHistorians is a wonderful place to be; the subreddits without even basic moderation do indeed turn into cesspools.

Federation has historically made this difficult because you can only moderate your own instance, which is often a minority of the content you consume. There are potential solutions for that, though. You could allow users to "flag" posts, and add those flags to the data you federate. Then, like the fake profile system, you could filter them based on user parameters. I.e. hide posts that have at least 1 "racism" flag.

If necessary to prevent brigades, you could also filter the flags that you accept based on the user that flagged the post. Maybe weight them based on similarities in accounts you follow.

aflag · on Sept 2, 2020

I have to admit I haven't look into p2p social networks in years. The ones I've tried in early 2000s (such as freenet) had really dodgy users. Perhaps a lot was learned since then.

Anyway, I agree with you that what really prevents people from using the system for nastiness is moderation and not necessarily identity. However, the problem is not only fringe and extremist beliefs, but actually becoming a platform for actual crime, through sharing of photos and videos of abuse. That's what worries me the most about p2p social networks and the reason I'm generally not ok with storing other people's data in my computer, even if encrypted.

lilactown · on Sept 1, 2020

A lot of facebook's value prop for users is in what you're saying - not as a social media company, but as a centralized identity provider for the social graph.

Perhaps a decentralized solution could still use a centralized identity provider to help with things like spam & impersonation/fake profiles and prove who they are, while still allowing users to own the information _about_ themselves.

The problem here is incentives; there's not much money in being a dumb identity resolver unless you're harvesting everyones data AFAICT, so I sort of doubt a private consumer solution will emerge in the near future.

corytheboyd · on Sept 1, 2020

It’ll never happen (in the United States at least) but this then sounds like a service that should be run by the local government, or perhaps something more global. People on HN often talk of making the internet an actual public utility, perhaps something like centralized online identify would follow a move like that

lilactown · on Sept 1, 2020

yes absolutely. In my scifi utopia, the government provides a digital identity to each citizen as a social service :)

One of the downsides of this is that it actually degrades privacy by improving private institutions (and the governments) ability to track you.

One way I've thought of to get around this is to allow people to spawn new identities from their own root identity. This could allow a certain amount of anonymity, while allowing institutions to filter identities by requiring a certain amount of social proof (e.g. length of time since identity has been created).

Some services might ban non-root identities (e.g. banks), while others (like a bar) just needs to be able to trust that you have a valid ID and can be used to prove you're over a certain bio age.

In the event that you need to prove you own a specific ID, there could be a way that given a root ID and a non-root ID, you re-generate the non-root ID from the root - thus proving you own it.

amelius · on Sept 1, 2020

Great idea. Perhaps we could apply it in different areas as well, such as consumer reviews?

beagle3 · on Sept 1, 2020

I think it’s a naïve position. Mail was decentralized. Now it’s mostly federated among 4-5 big providers.

Historically, people value low cost and convenience over any other feature. Unless there is some GDPR like directive requiring decentralization and self hosting, it will likely stay centralized and become even more so.

wolco · on Sept 1, 2020

There are so many more email providers being used worldwide. Who are the 4 or 5 big providers everyone is using? gmail, yahoo, hotmail, fastmail and protonmail? Don't forget outlook and zoho

Many people still use aol or their email given by ips or work email domain.

Worldwide yandex.mail is popular.

Colleges still give email addresses.

People like myself run their own

Vanity domains are popular.

Mail is still decentralized and news of its death have been greatly exaggerated

beagle3 · on Sept 1, 2020

Google, Fastmail, Microsoft(Outlook/Office365/Hotmail/Live) already cover something like 90% of one side of emails, and 50% of both sides (that is, 50% of emails originate AND terminate among these three, and 90% originate OR terminate among them). I haven't seen any recent stats, but in 2017, 50% of emails already originated or terminated with Google, one provider.

The significance of "originate OR terminate" is that the provider has the ability to meddle or outright ban, either going out or going in.

> work email domain.

> Vanity domains are popular.

> Colleges still give email addresses.

Most universities, workplaces and vanity domains are running within the Google+Fastmail+Microsoft sphere, regardless of the domain name. Every university I've worked with in the last 10 years (e.g. NYU) was using a G-Suite or Microsoft hosted solution, as was every single workplace.

> People like myself run their own

There is a very long tail for the last 1%, but likely 99% of the users are served by 10 companies worldwide. I call that federated, not distributed.

I used to run my own server as well, for ~15 years. Until one day, google stopped delivering it to gmail or g-suite addresses (sent it to spam). There was no one to talk to to figure out why. Which turned out to be ~70% of email I sent -- despite having SPF, DKIM, DMARC and everything else properly set up. Eventually, I switched to fastmail hosting because I have better things to do with my time than figure out why Google is blacklisting me.

It's federated, not distributed, and we're all at the mercy of the big few.

ccleve · on Sept 1, 2020

Most of the responses to your post are critical, but I couldn't agree with you more. Having a local database would solve all kinds of performance, connectivity and privacy problems. Transparent background synchronization would be beautiful.

Unfortunately, this is a really hard problem to solve. Sync is a bear of a problem in the face of conflicts. An immutable datastore with a log file of all mutations would help a lot, but you'll still have to manage the conflicts. This is not only a back end problem. What kind of user interface do you present that actually makes sense to an untrained user? "Bill changed Mary's name to Marybeth, but Fred changed it to Mary Beth. Which is correct?" Most users will say "Whaaaa? Why are you bothering me?"

willvarfar · on Sept 1, 2020

If you are syncing your databases with friends and colleagues, then there are multiple places to hack.

arethuza · on Sept 1, 2020

"tiny little databases which are synced with peers"

Wasn't that the idea behind Lotus Notes?

josephg · on Sept 1, 2020

Yes! There’s a lot of good ideas and features in lotus notes that I want, and despite it being 2020 I can’t seem to get for any price.

There’s reasons why Notes failed, but the data model was way ahead of its time and I think it’s past time we brought some of that innovation back.

pessimizer · on Sept 1, 2020

Isn't that what CouchDB was doing?

bergstromm466 · on Sept 1, 2020

> The most popular consumer apps (FB, LinkedIn, blogs, photo sharing, goal trackers, slack, basecamp etc) shouldn't be centralized in this fashion. Data(bases) should be owned by users, and synced with friends and colleagues. So instead of a single centralized database, we need a million tiny little databases which are synced with peers.

> Tiny databases fit well into today's hardware capabilities landscape. We have plenty of fast storage, and powerful CPUs on computers and phones. It's time to stop building as if the server is the only place where queries can run efficiently. Truly serverless, that is. This also improves data security in general; there is no central place to hack.

Sounds like you're describing the MetaCurrency Project's Holographic chain pattern that is currently being developed:

https://medium.com/holochain/holochain-reinventing-applicati...

Edit: I looked at Gitsta, have you heard of Junto? they are launching their app built on Holochain (which is itself built using Rust) very soon: https://github.com/juntofoundation/Junto + https://www.kickstarter.com/projects/1958309948/junto-a-new-...

jeswin · on Sept 1, 2020

Gitsta is work in progress; nothing to show yet, sorry.

I had looked at holochain earlier, but to be honest I found their literature a bit difficult to grasp (with a quick read). But I was aware though that they're trying to solve the same problem.

My view is that once you are able to reduce the amount of data an app has to deal with, you might even be able to rely on plain text files, yaml collections etc. for data storage/exchange; and fed into a db engine for faster querying.

I'll look up Junto, thanks for the link.

bergstromm466 · on Sept 3, 2020

> I had looked at holochain earlier, but to be honest I found their literature a bit difficult to grasp (with a quick read). But I was aware though that they're trying to solve the same problem.

I posted your feedback into the Holochain forum, and used it as an entryway into asking some questions I am grappling with:

https://forum.holochain.org/t/technical-holochain-101-for-se...

Do I have your permission to link it to our comments here?

> My view is that once you are able to reduce the amount of data an app has to deal with, you might even be able to rely on plain text files, yaml collections etc. for data storage/exchange; and fed into a db engine for faster querying.

If I understand you correctly then I think this is something Holochain's developers are tackling too. I think for the purpose you describe, they put a big focus on relationships and p2p actions that are 'gossip-ed' (spread) through a content addressable distributed hash table (if me mentioning this strategy seems unrelated, it shows the limitations of my technical competences).

-

If you're interested to engage, I'd love to know what things you found difficult to grasp in your quick read? This is a big subject with a lot of entry points, so my question is a broad question, on purpose. I'm interested in which functional use cases you are seeking/developing, and whether your critique of the current centralized architectures is similar to Arthur and Eric's critiques (the two Holochain originators), versus whether we have reached - and agree on - the same conclusions. The first one (critique) being more important to me: do we see the same challenges/shortcomings? I hope that make sense.

From my understanding Holochain's main thesis is that centralized software obscures 1) an app's inner logic/workings, and 2) the relationships between peers and the data they create (as well as the shape and size of the data they generate).

Through my own my research the overarching patterns I found is that corporations monopolize blueprints. Through the use of centralized black box software, they can also act as middlemen, extracting parasitic rents. They can control and manipulate us by tracking our behavior and steering us into the direction it's algorithms (and often the profit motive, e.g. advertising) have decided for us.

Society's innermost services (like money, law, identity, intellectual property databases, healthcare, land registry) are often invisible or semi-invisible architectures (they require a high clearance level, or the information is stuck in old proprietary formats that make them expensive and difficult to maintain). Because these services are still centralized, they evolve very slowly, meaning that their use remains limited.

A big part of my interest in Holochain comes from the work of the Mikorizal.org team (Bob and Lynn, and Pospi) being done on Open Value Network economic protocols. Using Holochain they are building a Resource Event Agent (REA) acounting model that is completely P2P. Their projects aims to provide an alternative to the closed corporate Enterprise Resouce Planning (ERP) software that is the norm today. I think this new paradigm can help us to see more clearly the actual scarcity of natural resources, as well as use it as a tool to cooperatively fund and develop open source hardware (and software) in the open.

Edit: updated for clarity.

jeswin · on Sept 3, 2020

Thanks, I'll engage on the forum directly. But I won't get time to look at it until this Saturday.

bergstromm466 · on Sept 3, 2020

Ah nice, I saw you engaged with my post there. Also just a note that I edited my previous comment for clarity.

Quarrelsome · on Sept 1, 2020

so I spill hot coffee on my laptop and brick it I lose my entire Facebook history? Not seeing it really.

Centralised _is_ what consumers want, without an generalised and public IT infrastructure for users to handle backups, which we don't have.

jeswin · on Sept 1, 2020

But you don't lose all your code if you spill coffee on it. Similar mechanisms could work here as well.

> generalised and public IT infrastructure

You don't need public infrastructure for this. There could be free and paid options (similar to Dropbox) to hold your data securely in encrypted form.

Quarrelsome · on Sept 1, 2020

I mean public in so far as it just happens without you thinking about it, that could be commercial too, its just more pervasive as public. When such infrastructure becomes a _guaranteed_ aspect of a given operating system. But until then people implicitly _want_ centralised.

josephg · on Sept 1, 2020

One healthy middle ground here is the situation with git. Git is fundamentally decentralized, but I can use it in a centralised way if I want to via services like github, gitlab, bitbucket, etc. Those services allow me to collaborate with other users and back up my projects. And because it’s decentralised, I can trivially move / clone my repository to other providers (or to my own servers) without losing any data. (... Well, except for github issues; which arguably should be stored in repository itself.)

I can imagine a similar situation for a small database too. Live clone to Dropbox or something similar. The user gets all the benefits of a centralised system but stays in control, and they can switch providers whenever they want.

Quarrelsome · on Sept 1, 2020

ye but you know in practice that means at least 95% of users are on the centralised version. We continue to build up this 5% of technocracy that have different rules applied to them because they know enough to care.

pessimizer · on Sept 1, 2020

That sounds centralized again.

edwintorok · on Sept 1, 2020

I think it is a common misconception that banks use database transactions the old fashioned way. They seem to be more like eventual consistency. I initiate a transfer and there is some check about available balance (although with various overdrafts, etc. it is unclear how strict), then it is marked pending and eventually the transaction succeeds or fails days later (an international transaction).

Hallucinaut · on Sept 1, 2020

This is conflating things. At a database level it is absolutely ACID compliant, full transactions are used for situations like you mention within one bank.

The fact there may be interplay between custodian banks and various triggered events has nothing to do with the type of database consistency features of an RDBMS.

jhardy54 · on Sept 1, 2020

Are you familiar with Scuttlebutt? It's exactly what you're describing.

dandanua · on Sept 1, 2020

Look at the hyper-stack https://hypercore-protocol.org/

deepstack · on Sept 1, 2020

mnesia seems like what you are looking for.

lioeters · on Sept 1, 2020

Fascinating! To save others a search:

Mnesia - "A heavy duty real-time distributed database" (written in Erlang)

Git repo - https://github.com/erlang-labs/mnesia

Documentation - http://erlang.org/doc/apps/mnesia/Mnesia_overview.html

deepstack · on Sept 2, 2020

Yeah, it is being around forever. Erisson or some Nordic company use it for phone switches. It is design for concurrency. Personally I think everything pretty much has being invented, it is all about people looking for it and bring into fashion.

lioeters · on Sept 2, 2020

I see, Erlang makes perfect sense for the purpose. I think I did read about Mnesia once before, but I must have forgotten about it. ;)

amelius · on Sept 1, 2020

Well, some projects use Software Transactional Memory, which shows that some database concepts are already used in normal applications.

faichai · on Sept 1, 2020

This is a way to do concurrency without explicit locks. It doesn’t help directly with a decentralised database.

AtlasBarfed · on Sept 1, 2020

Sync between friends? That will basically require standards hell, and the Standards+1 problem XKCD made famous.

RandoHolmes · on Sept 1, 2020

I actually feel like Fossil could be adapted to do this fairly easily. It uses Sqlite (it's made by the author of sqlite) so it has the same downsides described in the article, but if you could solve those problems I think it does pretty much exactly what you and the author are describing.

https://fossil-scm.org/home/doc/trunk/www/index.wiki

teleforce · on Sept 1, 2020

I'm longing and has the need for this type of database, and the time is ripe now for someone to implement it. It should be a local first and networked/distributed second, since IMHO the latter feature should be the sync problems not database problem per se.

The closest to this idea is SQLite (open source and mentioned in the article) and SQL Anywhere from SAP (commercial and not mentioned in the article). But as the article pointed out most of the current solutions rely on SQL.

I'd love to have the new database system to be open source and uses modern open data format for examples TileDB, Arrow and Parquet. The computation part should be language agnostic and independent of the programming languages such as JuliaDB or personally I'd prefer D programming language.

Coincidentally, around seven years ago Bryan from Joyent (Joyent's presentation is mentioned as the main motivation early in the article) has blogged about this similar effort targeting POSIX systems (specifically ZFS) that eventually becomes Manta and now part of Triton a.k.a SmartDataCenter [1]. For details implementation please refer to the original paper [2].

[1]http://dtrace.org/blogs/brendan/2013/06/25/manta-unix-meets-...

[2]https://queue.acm.org/detail.cfm?id=2645649

rapsey · on Sept 1, 2020

SQL is way better at what it does then regular programming languages. Writing queries with SQL is much more succinct.

grumpyprole · on Sept 1, 2020

That depends on the programming language, for example C# with LINQ and functional programming languages are pretty good at expressing queries. SQL has no capability for abstraction and so only the simplest queries will be succint, if you want to e.g. union two similar queries then you'll have to copy and paste.

hvidgaard · on Sept 1, 2020

C# LINQ is basically very roundabout way of doing what SQL was made for. What SQL does, and does really well, is operate on sets of data (aka tables) using set theory. You can get succinct expression with C# and functional languages, but they are not really possible to optimize the same way SQL is because it's not bound by set theory the same way.

Joker_vD · on Sept 1, 2020

[citation needed].

SQL by default operates on multisets instead of sets; iIt has no syntax for a set (aka table) literal, SELECT 1, "a" FROM DUAL UNION SELECT 2, "b" UNION ... doesn't count; sub-queries (using another SELECT in a FROM clause) are a pretty recent addition; and the list goes on. It's not based on a set theory, or on the first-order logic, or the relational model, not really. If it were, it would be be Prolog.

saarni · on Sept 2, 2020

Regarding table literals, some products support using

  VALUES (1, 'a'), (2, 'b'), ...

as a table constructor in general, not just in INSERT, and looking at SQL:92, SQL:99, and SQL:2003 it looks to me like this should be standard SQL. Derived tables, aka subqueries, exist in SQL:92 at least as well, so whether or not that is considered recent depends on how you look at it, I think.

I am not trying to defend SQL with this, and all in all this does not take away from the points you raised, but the above were something that stood out.

Joker_vD · on Sept 2, 2020

They were in SQL:92, but IIRC adoption was somewhat slow and patchy, and has generally finished somewhen in the early 2000s.

My point is, the parent's claim that SQL "operate[s] on sets of data (aka tables) using set theory", and does it "really well" but when you actually look at SQL, you realise that sets/tables aren't really first-class ― derived tables were added in later versions of SQL, and literal tables still don't exist, but those are things you expect a language focused on table manipulations to have. Nope, it's a language for building very specific kinds of queries which was then patched and extended into something more general.

dan-robertson · on Sept 1, 2020

I find most strongly typed functional languages get too hung up on the structure of the rows. E.g. what happens to SELECT user_name, price FROM orders JOIN users ON user_id? You might end up with something like this:

  join users orders (\user -> user.user_id) (\order -> order.id)
  |> map (\(user,order) -> user.user_name, order.price)

But that will quickly get messy if you have multiple tables, and you have to explicitly say how to order joins (which you don’t do in sql), and something which just analyses the function calls (ie can’t look inside the lambdas) can’t really know that it only needs to look at a few columns of the tables. So maybe instead you need special values to represent the columns of your tables but then how can you represent the necessary renaming when you join one table multiple times? And how will you describe computations with those fields?

Ultimately I think this is a problem largely involving bindings and sets of things and that these are difficult (if not impossible) to move to the type level while providing a nice api. So you could either get a risky api that can throw runtime errors for queries that could never be valid or you get issues like the ones I described above.

marcosdumay · on Sept 1, 2020

Have you used linq?

Datasets there work like [(column_name, value)] heterogenous lists would work on Haskell. It does really not get that problematic. (All the problems go away in Haskell if you decide to use maps too.)

But yes, it throws runtime errors. The nature of retrieving data from a foreign service is that it will throw runtime errors. There is some verification you can do, but a strong typed database layer just makes promises it can not fullfill.

tams · on Sept 1, 2020

With Common Table Expressions you can process the shared data in one set ("temporary table" is a good mental model for it), compute the unique parts in subsequent sets and use the results for the UNION in a single query.

tomkwong · on Sept 1, 2020

Building queries in a programming language is more flexible and the data operation is explicit. You rely on good programmers to do it right.

By contrast, SQL is more constrained about what it can do. It's declarative and you rely on the database engine to optimize the procedure.

Neither is perfect.

SigmundA · on Sept 1, 2020

Linq can be used in a declarative way such that the AST is available and can be rewritten for optimization or transformation say to SQL [1].

Not that this is easy but many project utilize it and the relinq project tries to give you a more usable starting point [2].

After having done SQL for years then Linq, I actually prefer the more explicit operation of Linq by default, I know the order of execution is the one I specify. Also being able to use a full rich imperative language in the query is so very useful.

[1] https://docs.microsoft.com/en-us/dotnet/api/system.linq.ique...

[2] https://github.com/re-motion/Relinq

fanf2 · on Sept 1, 2020

Does anyone know of a good description of how SQL Anywhere’s sync protocols work? (Mobilink and SQL Remote - https://en.wikipedia.org/wiki/SQL_Anywhere#Technologies)

jwr · on Sept 1, 2020

I experimented with various approaches to client-side data handling and ended up realizing that I do not want a client-side database at all. I'd much rather hold data in language-native data structures (in my case, Clojure/ClojureScript). These are very powerful and there is zero impedance mismatch: after all, they are a part of the language.

Yes, you do need queries, but there are not many and they might as well be coded as functions. You can do all the indexing you want, if you need to, but JavaScript in browsers is so incredibly fast these days, that even a linear search works well for most cases.

Unless you are handling gigabytes of data client-side, of course. But that wouldn't make much sense.

Data modification needs to go through a server API anyway, because the client environment can't be trusted, so two-way sync would be useless. I want the server to perform data model operations while verifying everything, I do not want data sync.

m_fayer · on Sept 2, 2020

I can see this being sufficient as far as performance and development-ergonomics. But.

I would miss the safety and durability of a (preferably relational) database.

In the end, what I would gain in convenience and development speed I would lose in having to build my own guardrails.

treis · on Sept 1, 2020

>I'd much rather hold data in language-native data structures (in my case, Clojure/ClojureScript). These are very powerful and there is zero impedance mismatch: after all, they are a part of the language.

>Yes, you do need queries, but there are not many and they might as well be coded as functions. You can do all the indexing you want, if you need to, but JavaScript in browsers is so incredibly fast these days, that even a linear search works well for most cases.

That's just a bad database. It will work for small amounts of data and simple queries. But add in something complex or lots of data and it will fall over.

mumblemumble · on Sept 1, 2020

In theory, if it's all in a single application's memory and using langauge-native data structures, then you have a lot of power to tailor the data model to meet your application's needs. Barring concerns like object overhead (which I realize is a major consideration in languages like JavaScript and Clojure), you should be able to come up with something that is more efficient than a general-purpose solution like a database without too much trouble. It's really just a special case of the Chuck Moore style, "I coded my own $thing in 1/10 the code and 10x the performance of the popular off-the-shelf option, in a week. NBD, it's easy if you simply choose not to implement any of the features you don't need." thing.

Where DBMSes truly shine is working efficiently with data that you can't or don't want to keep in memory all at once, or for which lots of different clients need concurrent access. The first of those (but hopefully not the third) are absolutely considerations for certain kinds of browser apps, but maybe not something the parent was worrying about.

phaedrus · on Sept 1, 2020

I've long thought there's a potentially-useful/missing piece of software infrastructure that's midway between code or data-written-as-code and a full blown database (even one as small as Sqlite).

dsun180 · on Sept 1, 2020

If you do not need a database at all, of course you can just use the plain programming language.

obi1kenobi · on Sept 1, 2020

I have to question the "relational" part. SQL is an absolutely massive language, to the point where "we support SQL" is almost an empty statement because there's no single SQL to support.

While revamping the database, consider a smaller query language. I built a GraphQL-based database query language in my GraphQL compiler project[1], and it's been a huge success at my company so far[2][3]. Here's a demo of cross-database querying with it, if you're curious [4].

[1]: https://github.com/kensho-technologies/graphql-compiler

[2]: https://blog.kensho.com/compiled-graphql-as-a-database-query...

[3]: https://blog.kensho.com/database-agnostic-querying-is-unavoi...

[4]: https://github.com/obi1kenobi/graphql-compiler-cross-db-exam...

mamcx · on Sept 1, 2020

I'm building a relational lang (http://tablam.org) so this is close to my interests.

SQL is not just massive, it have grown past it actual capabilities and original purposes without fix or clean it, then ALSO need to be the interface for some of the most complex piece of software (rdbms) and support the needs of a multitude of features, frameworks and languages. Is like html/css/js.

Going to basics, the relational model is fairly simple, and you can spice it with functional and/or procedural features to round it and still keep it very small. So yes, I think it can be done MUCH better than SQL and still do MUCH more than it can right now...

P.D: My language TablaM (will!)is basically this plus a little of array programming. You can fit very well the relational model in a lisp/tcl/datalog like environment with a small core that is very expressive...

blacksqr · on Sept 1, 2020

Speaking of Tcl, there's TclRAL, a C-based Tcl extension that implements the Relational Algebra.

https://wiki.tcl-lang.org/page/TclRAL

emmanueloga_ · on Sept 1, 2020

For the spanish speakers on this thread, the name "TablaM" reminds me of this: [1] :-p

1: https://www.youtube.com/watch?v=8ktYyme_sUw

mamcx · on Sept 1, 2020

Ja!

In fact "TablaM" come to sound like "Kablam!" https://www.youtube.com/watch?v=9glnVMq1xrE, a nickelodeon weird program of the past (when it do a lot of cool things). I have a habit of put weird names to my projects...

Then I morph into tablam, and later in the game I create a pseudo-plausible rationalization where "Tabla (spanish)" == "Table" and M is for "Multi-paradigm, -platform, -dimension, etc" that is the one explanation I will use in case get serious people asking me :)

P.D: I have a video for this at https://www.youtube.com/watch?v=Qs6byLBqyR4

emilsedgh · on Sept 1, 2020

I'm not really an expert on the field. But, have you considered that maybe SQL is massive for good reasons?

The subset of SQL that covers majority of use cases is actually small and probably mostly covered by the standard.

All the rest of SQL which makes it a massive beast is for all those cases where adding the logic to the database would've been the most efficient for all those who need that extra bit of efficiency.

setr · on Sept 1, 2020

I think SQL is massive because of two major reasons:

1. It has a highly inconsistent, non-generalized syntax

2. It tries to reuse the language for everything -- from schema specification, to querying, to administration, to query engine hints, to procedural language definitions

The RDBMS in total is a highly complex system, but SQL adds a lot of accidental complexity on top of it.

Also for a new DB, you hardly need to support the whole suite of features to produce a useful system -- you just need to constrain the recommended usecases. See SQLite for the quintessential example -- they dropped MVCC and a lot of management/multi-tenancy features as they targetted a surprisingly large niche.

As it turns out, there's a wide variety of data management needs -- SQL is often a beast because most RDBMS's have a goal of targetting all of them.

zoltar · on Sept 1, 2020

It's slightly more likely that modern SQL is the byproduct of several query language specifications accelerated to ludicrous speeds and set to collide with each other. Out one end of the apparatus, SQL-86 emerged.

A little known fact about this experiment: the residue splattered along the interior of the apparatus was recovered and named LDAP.

fanf2 · on Sept 1, 2020

LDAP is a stripped down version of the X.500 directory, but I wasn’t aware that had anything to do with SQL. X.500 is hierarchical, not relational, and it doesn’t have a query language as such - http://sec.cs.kent.ac.uk/x500book/Chapter.5/Chapter5b.htm#5.... - neither did the original version of LDAP (RFC 1487 https://tools.ietf.org/html/rfc1487 RFC 1488 https://tools.ietf.org/html/rfc1488). The text representation of filters didn’t turn up until RFC 1960 https://www.rfc-editor.org/rfc/rfc1960.html (1996) and it’s just a rendering of the ASN.1 structure.

I thought SQL descended relatively straightforwardly from SEQUEL.

dragonwriter · on Sept 1, 2020

> have to question the "relational" part. SQL is an absolutely massive language

SQL is not the same thing as relational. Relational is a model, SQL is a language which (arguably, quite loosely) fits that model.

lgas · on Sept 1, 2020

Yes and there are in fact better languages for dealing with relational data than SQL, eg. https://en.wikipedia.org/wiki/D_(data_language_specification...

xixixao · on Sept 1, 2020

I agree. I’d focus on a graph database, as it is a better model for most things, and can act as a dumb table too.

interestingOnes = Pokemons.queryAll() .whereColor(P.equals(“grass”)) .queryWorstEnemy() .whereColor(P.notEquals(“fire”)) .get();

throwaway_pdp09 · on Sept 1, 2020

Why is it 'massive'? I don't think it is. Compared to say C++, or programming in python where the language is smallish but the libraries considerable, I can't see why you say that.

yashap · on Sept 1, 2020

> I wonder if I’ll ever be able to get this done.

I can’t help but be extremely pessimistic. This is an insanely challenging project. Looking at the author’s other side projects: https://euandre.org/about.html

I’d guess this project is about 10,000x-100,000x as challenging as the Songbooks one. And the author has no obvious experience writing DBs, that I can see?

This project seems at least as ambitious as CockroachDB. Development started on CockroachDB 6.5 years ago, was founded by a group of Google alums with relevant distributed systems experience (working on GFS), and it since became well funded and has had a sizeable team working on it full time for years. This project would be almost impossible to make significant progress against without a strong, well funded team working on it for years, but the author speaks about it as a side project.

kieckerjan · on Sept 1, 2020

What does "10,000x-100,000x as challenging" mean? If the songbook cost him one week, this will cost him at least 200 years? I love a good hyperbole, but this is just disparaging.

bufferoverflow · on Sept 1, 2020

Most databases that we use are written by hundreds to thousands of people and take years to become reliable and mostly bug-free. It can easily take much more than 200 programmer-years.

kieckerjan · on Sept 1, 2020

By that yardstick nobody should ever attempt to write an OS or programming language or game engine from scratch. I am glad there are plenty of individuals who try.

bufferoverflow · on Sept 1, 2020

You can write a simple OS or a game engine, coding solo.

But you will never have enough time to write something comparable to modern Linux or Unreal alone.

Const-me · on Sept 1, 2020

> you will never have enough time to write something comparable to modern Linux or Unreal alone

That is only true if you're retaining their respective requirements and architecture. Not necessarily a good idea.

Linux supports tons of devices and architectures very few people are using: people are only paid to add stuff, no one is paid to remove it once no longer needed, so it tends to accumulate. Writing an OS in C is not necessarily a good idea in 2020, if I would be interested in writing a new OS I would do it in .NET.

Unreal engine is not that complicated in comparison to modern Linux. I think when people are watching modern video games or technical demos and saying "wow", they overestimate complexity of code, and underestimate complexity of assets.

KptMarchewa · on Sept 1, 2020

No, but no one should assume that they'll be able to "complete" this as side project.

Quarrelsome · on Sept 1, 2020

ye but be realistic. If you're just one dude that has no experience in creating operating systems then you shouldn't seriously try to write one from scratch.

For giggles, sure go into the project with the idea that it will fail, that's not an issue. However if you're serious then what the poster is saying is that its over a lifetime's worth of work.

smnscu · on Sept 1, 2020

Time to plug the osdev wiki, a very cool read even if you're not planning to write your own OS. Here's the beginner mistakes page: https://wiki.osdev.org/Beginner_Mistakes

majewsky · on Sept 1, 2020

> If you're just one dude that has no experience in creating operating systems then you shouldn't seriously try to write one from scratch.

In the alternate reality following your dictum, Linux does not exist.

freehunter · on Sept 1, 2020

The Linux that was created by one person is far from the Linux that exists today. The Linux that exists today was created by hundreds (thousands?) of people over the span of decades.

bufferoverflow · on Sept 1, 2020

Thousands of people wrote Linux and all the parts that it uses.

ric2b · on Sept 3, 2020

But they're usually started by one or a few people. If they get somewhere promising more resources can be added over time to make it more than a cool PoC.

inopinatus · on Sept 1, 2020

That could also be written as 40-50dB more challenging.

eezurr · on Sept 1, 2020

Currently, there are 230,500 lines of code in the SQLite source code, for reference.

yashap · on Sept 1, 2020

Fair enough, pulled those numbers out of my a, and they are indeed too high. ~1,000x seems like a better incredibly rough guess. i.e. if the Songbook project is 1 dev, 1-2 weeks, this would be more like (extremely roughly 2-4 years for a team of 10 devs).

brulard · on Sept 1, 2020

A comma is being used as decimal separator in some locales...

stavros · on Sept 1, 2020

He doesn't have to start from scratch, he can implement syncing/atoms/whatever else on top of SQLite.

yashap · on Sept 2, 2020

I don't think so. For one, he wants it to run in the browser, and SQLite doesn't run in the browser. In his "embedded" section, he makes it pretty clear that he doesn't want to build on SQLite:

> SQLite is a great example of that: it is a very powerful relational database that runs almost anywhere. What I miss from it that SQLite doesn’t provide is the ability to run it on the browser

> Mentat was an interesting project, but its reliance on SQLite makes it inherit most of the downsides (and benefits too) of SQLite itself.

Also, I'm sceptical you could make SQLite efficient for an immutable DB, IF you want to support full SQL. A DB like Datomic has to do some pretty crazy stuff to be remotely efficient, and it doesn't even try to support full SQL - a great read on Datomic internals: https://tonsky.me/blog/unofficial-guide-to-datomic-internals

Mentat, the DB he links to, is much less ambitious than this proposal, and had a couple years of work by a number of devs before eventually being shuttered/abandoned. It's about 35K lines of Rust, despite being built on top of SQLite, isn't remotely done, doesn't run in the browser, etc.

Even if you did build this proposed DB on top of another DB, to make it efficient, and have all of the features discussed, I strongly suspect that it would be an absolutely massive amount of work.

roenxi · on Sept 1, 2020

CockroachDB sounds like it has a target market of a couple of hundred to low thousands companies though. How many people are in the market for a "[A] SQL database for building global, scalable cloud services that survive disasters." [0]? It sounds like a pretty good product, lots of potential, etc. But there are not that many people who need to run their own cloud and maintain uptime in the event of a data centre being taken out. If I ever use CockroachDB it is probably because some cloud provider decided to offer it and not Postgres. Those core features are targeted at problems that only a very small subset of people have.

Compare that to embedded/immutable/syncable/relational. Everybody in software has dealt with software where that would be a compelling feature set in a database. These are solutions to problems any developer has encountered. It is, in that sense, a much easier project for people to understand and get behind.

OP probably doesn't have useful opinions on how his software should handle a direct nuclear strike on his production machine. He probably does have useful opinions on when he should cheat and let a little bit of mutability into the otherwise 'immutable' system.

[0] https://www.cockroachlabs.com/docs/stable/

allenu · on Sept 1, 2020

I've built out a journal layer on top of SQLite for a flash card app I'm working on. I wanted to be able to sync data across devices but I knew that I'd always just have to handle a single-user/multi-device scenario and never multi-user/multi-device. I was able to come up with a simple CRDT style system where I just log the modifications made per device and share them out with other devices via simple files (i.e. upload to central file store and download peer journals). Each device then has to reconstruct the state of the SQLite database entries using the shared journal files. Since it assumes single user, I can safely use "last write wins" as the conflict resolution strategy.

It's working reasonably well for my small use case, but there are definitely limitations that you have to be aware of. The big one is the fact that it's journal-based means the journal can grow and grow. Since I use simple files as the protocol for sharing changes, this means uploading/downloading the same content more than I'd like. If I had a stream-based server that just handled the deltas, it would be better, but my goal with this project was to avoid having to roll my own service. (I wanted users to provide their own cloud storage, say iCloud or Dropbox, and the app would just use it as a central repo, meaning no fees for maintaining the service.)

There are tricks around these limitations that I can work on if I had the time (my main goal is to build the app and not a sync system). For example, I could cap the journal files to a certain size and create a new one when the limit is reached. This avoids having to upload the same contents too frequently. I could also have a strategy where I merge all journals into a central file every so often so that the state of the world is shared there instead of the journals (which can then be discarded).

pdimitar · on Sept 1, 2020

Interesting, thanks for sharing.

I'm curious as to why you didn't use sqlite's changeset / patchset API?

allenu · on Sept 1, 2020

Interesting! I wasn’t aware of that feature. I actually built out my idea first as its own toy database with journals and later added SQLite as the storage mechanism.

Patchset could work for this, but I’d have to play with it a bit. Off the top of my head, I’d need to redesign things to consider time in terms of sessions as opposed to granular changes. i.e. today I can just append deltas and use a time stamp to know how it fits chronologically with other changes in peers. A new peer created could sort all deltas to play back the database from scratch. I suppose with patchsets it could sort them all by start time and apply them one after the other to recreate the database.

It’s an interesting problem because you can’t assume to have synced recently, so you don’t know the baseline you are starting with, but for the application I’m building, that may not matter.

treekogreen · on Sept 1, 2020

This sounds like a cool project. Considering its a journalling app and I'm assuming the user just enters plain text, have you experimented with compression and decompression before and after the SQLite storage?

allenu · on Sept 1, 2020

It's actually a flash card app, not a journaling one, so the text entered is quite short. I haven't played around with the compression idea, but it sounds like it could be useful for large text entries.

mey · on Sept 1, 2020

Immutability is an interesting property, but reality is messy. Depending on the use case you envision for the database, data that needs to be scrubbed will accidentally be added to the system. This could be an important private key, or something legally dubious. In an append only/sync system there isn't really a good way to scrub that data. (Replay the entire history, filter bad data, then delete all remotes and resync from clean copy is not good). Again depends entirely on the use case of the database you envision, but if you wish to make it general purpose it is something to consider.

sixdimensional · on Sept 1, 2020

Temporal databases - where SQL:2011 left off. I think the concepts of bitemporality and tritemporality are what the post author is looking for.

Something the original post reminds me of a little is Crux [1].

[1] https://opencrux.com/main/index.html

coxmi · on Sept 1, 2020

This article[1] from 2015 has stuck with me for years, talking about the same approach. The interesting thing about transactional/bitemporal databases is really the effect on how we consume them with our applications. The schema can be separated from the transaction data, schema migrations become easy, tables become cached materialized views (with the cache updated concurrently at write time), reduced need for application-level cache, no cache cold start, and pub/sub becomes the norm.

[1] https://www.confluent.io/blog/turning-the-database-inside-ou...

(Nice to see it's still relevant – the article is linked at the very top of the crux about page: https://opencrux.com/about/what-is-crux.html)

sixdimensional · on Sept 2, 2020

I’m the same regarding that article. I always thought of that article as a kind of “deconstruction” or “decoupling” and “rearrangement” of traditional DBMS architecture.

The end result you speak of is a more flexible data platform, which can handle more use cases, including applying decoupled parts to different things.

A good example, in my mind, is Apache Spark - which is like taking a compute and SQL query engine, making it distributed and decoupling it from the underlying storage. Now, suddenly SQL or code can be used on anything you can fit into a data frame, even streams - wow!

Pulling out the transaction log - and wow, suddenly we have durable, potentially immutable queues we can use for even long term storage.

I was so excited by the line of thinking in that article. :)

roenxi · on Sept 1, 2020

> Immutability is an interesting property, but reality is messy.

Words have fuzzy boundaries - immutable obviously doesn't mean actually immutable. It means immutable like how git is immutable - you're not meant to change history. It'll fight like a tiger to stop you easily changing history. It'll give you two different ways to get what you want while preserving history. But if you want to change history you can.

SQL is said to be declarative but if you get hit dealing with the wrong sort of workload in SQL then the implementation details can start to matter extremely quickly.

In practice, the author probably only means 'much more immutable' rather than 'illegal to use with European data' level immutability.

juped · on Sept 1, 2020

Git's primary advantage over other VCS is how it makes it trivially easy to change history, though.

andriosr · on Sept 1, 2020

True. And what makes it easy is it's immutability. When you change history in Git, you rewrite it (rebase). But re-writing is only possible if you know every bit of the journey up to now. When you have actual updates, history is lost.

afiori · on Sept 1, 2020

the way git changes history is a stretch on the word change, rebasing creates a parallel history and then promotes it to be the true one. the reflog still remembers the previous history.

Spooky23 · on Sept 1, 2020

It’s interesting because it reflects the reality of modern computing — for many use cases storage capacity is not a meaningful constraint.

I ran pretty important systems as a DBA years ago that we’re significant but would never, ever require a meaningful constraint on growth. (Performance, different story)

Sometimes looking at a problem like a database with a new set of constraints leads to interesting and novel things.

foo_barrio · on Sept 1, 2020

I've been working on personal and professional projects where growth isn't a truly restricting constraint but is still an actual constraint. What do you mean by "meaningful"? Once I get into the 10's of terabytes there's no storage system where copying a TiB of data is "instant" or ignorable.

Professionaly I work on a financial reporting system where there are around 1.5 million accounts with trading activity. Things like moving a month of data around and checksumming still take on the order of 1 hour. If the size of that monthly dataset doubled, we'd be OKAY but it it grew 10x we'd have to start restructuring things both on the software side and the hardware side.

For personal work I am working with searching video frame substrings regardless of codec, resolution etc within videos in a ~50 TiB video set. Having to re-encode hundrends of GiB of video at a time is some thing that takes a huge amount of time and nothing something I can just "throw into an S3 bucket". The growth of said working set is not truly constrained but is actually constrained. I can only afford so many 8-10 TiB HDDs at a time! They also fail and have to be mirrored/backedup with rsycn at the very least which adds to the cost for me.

Spooky23 · on Sept 1, 2020

I mean that it’s not a technical constraint or limit. Not that its trivial to do! If there is an ROI in keeping stuff forever, you can.

Think about what you’re saying about your personal project. What you’re talking about wasn’t financially achievable for TV companies just 20 years ago. Hell, anything older than a few days would be stored offline in some tape silo, or in a shelf. (And 10-12 people to attend to it!) Now you as an individual could afford to do all sorts of stuff with that video data, albeit for some significant $.

dllthomas · on Sept 1, 2020

Sure... but needing to purge sensitive data also reflects the reality of modern computing.

unsure whether to laugh or cry

zupa-hu · on Sept 1, 2020

Let me disagree. I'm working on a transactional OS [1] with an immutable filesystem, it's primarily a website builder. Every request creates a new snapshot of your website. It's append-only for the duration of the transaction. That's the only way to provide robust rollback and also provides version snapshots for free - but it doesn't mean snapshot's can't later be removed.

My solution was to provide an undo-chain of a fixed given length. Versions older than that simply "fall off the chain" and will be picked up by GC eventually. You can also make snapshots long-lived snapshots if you want.

(Of course removing a snapshot is an append-only operation :P but GC will go after non-referenced snapshots eventually.)

[1] https://boomla.com/

black_knight · on Sept 1, 2020

The problem of deleting stuff from a history of changes is challenging. But I think Darcs, the distributed version control system written in Haskell, does this rather neatly.

Darcs is different from git in that a "commit" (called a "patch" in Darcs lingo) does not depend upon the entire history of what has come before. Rather it has specified dependencies, which Darcs can infer itself – or you can specify if you want more semantic content in the patch structure.

A repository is a collection of patches, which are merged using something called patch algebra. So, if you want to delete a patch, you just issue the command "darcs obliterate" and it will remove the patch, and anything which depends on it. Then working directory is reconstructed by merging together any loose ends using patch algebra.

With careful administration of dependencies, this approach makes deleting specific things very slick.

dybber · on Sept 1, 2020

Yeah, immutability doesn’t seem desirable in a world with GDPR and the right to be forgotten, which requires you to support deletion.

nlitened · on Sept 1, 2020

Database immutability does not mean it's impossible to delete content.

It means you separate two use cases: retracting information (I remove an item from my shopping cart because I want another item) and purging data (I delete my account and all my shopping history because I don't want the shop to store my data anymore).

You can still purge data easily in Datomic, for example.

coddle-hark · on Sept 1, 2020

I really don’t get why GDPR gets dragged up every time someone mentions immutable data stores; was everyone really building append-only systems before 2018?

breakfastduck · on Sept 1, 2020

It's because GDPR made things a law, not just a best practice. It does need to be considered, and it's worth making sure people remember it.

fanf2 · on Sept 1, 2020

It has been a legal requirement in the UK that databases must be able to delete personal data since the data protection act 1984 - https://www.legislation.gov.uk/ukpga/1984/35/enacted - in particular see the data protection principles in schedule 1: the 6th principle requires that data is retained no longer than necessary. There are provisions elsewhere in the act that can require a data holder to delete data.

Most of GDPR is not really new. There was a significant amount of teaching about data protection when I was at university in 1994-1997 and it wasn’t even new then.

throwaway_pdp09 · on Sept 1, 2020

> Git repositories can only grow in size, and that is not only an acceptable condition, but also one of the reasons to use it.

That is so backwards I can't take the guy seriously.

cafxx · on Sept 1, 2020

> SQLite [...] assumes a POSIX filesystem that would have to be emulated.

No it does not. It has a pluggable storage interface that makes no such assumption: https://www.sqlite.org/vfs.html

EuAndreh · on Sept 3, 2020

Thanks for the correction, I've amended the text.

https://euandre.org/2020/08/31/the-database-i-wish-i-had.htm...

carry_bit · on Sept 1, 2020

About a couple years ago I used sql.js [1] in an internal tool to provide the user the ability to execute SQL queries against the data they had loaded. It worked just fine.

[1] https://github.com/sql-js/sql.js

scribu · on Sept 1, 2020

Indeed:

> test_onefile.c - This file implements a demonstration VFS named "fs" that shows how SQLite can be used on an embedded device that lacks a filesystem.

earthboundkid · on Sept 1, 2020

Came here to say that SQLite for WASM already exists. Odd to just assume it’d be impossible.

l0b0 · on Sept 1, 2020

Especially an immutable database (like Git's index) sounds like a very powerful concept. No more need for modification dates (they would be in the change record) and no need at all for separate log tables and all that guff. Every CRUD system ends up with some sort of audit log tacked on, costing heaps in terms of both maintenance, storage and extra processing for every change. In addition I would want there to be a command analogous to `vacuum` to truncate this history while the database is online, similar to Git's rebase or filter-branch.

jnwatson · on Sept 1, 2020

I already got one? My employer mostly has these requirements. We ended up writing our own query language because it is a graph DB, but all the bits are there.

LMDB + a journal + a query interpreter is really all you need. We don't have off-line mirroring yet, but that's on the list.

You could start with Synapse [] and rip out the bits you don't need. It is open source.

[] https://github.com/vertexproject/synapse

virtualwhys · on Sept 1, 2020

> What I miss from it that SQLite doesn’t provide is the ability to run it on the browser.

Cries in abandoned Web SQL[0]

[0] https://en.m.wikipedia.org/wiki/Web_SQL_Database

zubairq · on Sept 1, 2020

Yeah, I agree. At AppShare we take our server side databases and let them run in the browser. We even copy the data so that you can take a server side Sqlite database as save the Sqlite engine and all the data as a HTML file

CodesInChaos · on Sept 1, 2020

IMO compiling sqlite to webassembly is a better approach than declaring a specific implementation a standard and integrating it in browsers.

majewsky · on Sept 1, 2020

Having it on the browser level would have the huge advantage of the DB contents being visible to the users in devtools (not just to the developers via their custom tooling).

osmarks · on Sept 1, 2020

I don't think browsers have a good API it could store to, though. As far as I know there isn't really anything for storing efficiently in-place updateable/readable large binary blobs.

CodesInChaos · on Sept 1, 2020

What about IndexedDB?

A basic implementation would use one document per page. An advanced integration might replace the whole sqlite storage engine with one based on IndexedDB.

osmarks · on Sept 3, 2020

That could work, I forgot SQLite worked by pages. It's still probably not ideal, though.

MaxBarraclough · on Sept 1, 2020

What do you make of IndexedDB?

virtualwhys · on Sept 1, 2020

I'd like to make it go away, but as a simple k-> v store it's decent enough. Web SQL provided relational data and aggregation, something that's sorely lacking from IndexedDB (and the libraries that build on top of it, poorly replicating SQL in the browser).

The amount of client side code one has to write to replicate simple SQL group by/sum operations is mind numbingly bad -- not to mention performance hit vs. SQLite's blazing native C performance (and it shipped with the browser to boot, no separate download of some lousy bloated 3rd party JS library).

legg0myegg0 · on Sept 1, 2020

I think DuckDB checks a number of these boxes! It is embedded and written in C so compilable to WASM. It is also 10x faster than SQLite and interoperable with Apache Arrow! It might be a good place to start anyway!

andriosr · on Sept 1, 2020

I think it's crazy that database updates are still acceptable in the design of systems and databases. Updates are deletes in practice, you lose what was there before.

Getting to know how Git and blockchains in general works will help in your journey. I used these concepts in the design of an immutable ledger system, the result was nice: https://decimals.substack.com/p/things-i-wish-i-knew-before-...

Someone1234 · on Sept 1, 2020

Updates are a cost compromise.

Each record is stored in 2 * [# of nodes] * 3 + [backups] places. In our case that's 15 places before we talk actual backups (online, offline, and cold storage).

I'm actually a huge fan of immutable records conceptually, but I am unable to make a business case for it outside the most vital tables, that need amazing audit histories. The costs are just too high, and a lot of data not valuable enough.

I'll acknowledge that if engineering time was lower it would be an easier sell, but in large databases you're still talking about at least a quadrupling of space through every layer: What's the dollars and cents argument for that?

josephg · on Sept 1, 2020

Most software isn’t FAANG. In the long tail of database size, I’d guess 95%+ of databases are smaller than a few hundred megs. And at that size, storage is cheap and storing a full historical record of changes should absolutely be the default.

Rich Hickey has a great talk on this a few years ago talking about the difference between Place and Value. He says accountants learned this lesson hundreds of years ago with bookkeeping. Users just don’t generate much data relative to the size of our disks. Well worth a watch, especially if you disagree:

https://youtu.be/-6BsiVyC1kM

ascar · on Sept 1, 2020

> Most software isn’t FAANG. In the long tail of database size, I’d guess 95%+ of databases are smaller than a few hundred megs.

I think that statement is plain and simply untrue for most databases that generate enough value to pay for a developer working on it. I would bet it can be reversed to at least 95% of business relevant databases are larger than a few hundred mb.

I manage a small 200 player old-school browser game for a few years and our database creates a few hundred megabyte per month. A major source of this data is a user trace-log ('what action they performed when' compact as a timestamp and 2 integers), which we clear on every new round (about 2 months). Keeping every update (instead of the small trace log) would easily scale to gigabytes per month. And we're talking 200 users. I also worked for a small 150 employee web service company, where the development database snapshot was about 5gb (a heavily trimmed down and anonymised version of the production db).

Now, cold storage is cheap. But fast access database storage isn't the same as a disk. That's why we keep hot data in memory, only write to disk when necessary and backup to cheaper storage.

Lastly, a relational databases WAL is a complete historical record of all writes and reads, not just changes. It's used by default and in fact necessary to accomplish ACID and consistency guarantees. Keeping this log gives you the full history that allows to restore the full database state to any given point in time without polluting the actual data with historical records. Granted, access to this data is much harder than a simple query.

There are also many other options to keep historical data rather than keeping it in the OLTP database, like OLAP data lakes.

Someone1234 · on Sept 1, 2020

So immutable records only makes business sense in databases too small to benefit from immutable records (where you can just version the entire database, for a "few hundred megs")?

I don't understand what FAANG has to do with this. Medium or large non-tech companies commonly have business critical databases in the hundreds of gig range.

Acting like medium to large databases aren't common in business is just outright odd. The argument also doesn't address what I asked (business case for this vis-a-vis cost).

josephg · on Sept 3, 2020

> So immutable records only makes business sense in databases too small to benefit from immutable records

My claim is that discarding historical data is an optimization. Its an optimization that should be off by default and turned on when needed. Archiving and compacting history should be something you do only need to do when your database size gets out of control.

For small databases there's no reason to throw away historical records at all - and having an immutable log of records and updates should be the default.

dan-robertson · on Sept 1, 2020

GDPR requires you to delete the information you have about people under some common circumstances, if they ask you to.

andriosr · on Sept 1, 2020

Sure. Immutable doesn't mean you can't delete things:

> True. And what makes it easy is it's immutability. When you change history in Git, you rewrite it (rebase). But re-writing is only possible if you know every bit of the journey up to now. When you have actual updates, history is lost.

https://news.ycombinator.com/reply?id=24340659&goto=threads%...

AriaMinaei · on Sept 1, 2020

> GDPR requires you to delete the information you have about people...

And so does common sense.

hden · on Sept 1, 2020

Datahike is an immutable Datalog database (like Datomic) powered by the hitchhiker-tree.

As of v0.6.0 it use hitchhiker-tree synchronization for replication.

https://github.com/replikativ/datahike

knubie · on Sept 1, 2020

To clarify: it is planned to use hitchhiker tree synchronization in v0.6.0. The latest release as of now is v0.3.1.

0xCMP · on Sept 1, 2020

Also have wanted this. Operating on data locally means it can be fast and low latency (on device, not between devices).

I think the secret lies in using CRDTs to sync data in a way that allows it to be processed into another system. E.g. Postgres on a server and SQLite on a phone.

mbalex99 · on Sept 1, 2020

You should definitely take a look at what we’re doing at www.ditto.live

spankalee · on Sept 1, 2020

Sounds very much like noms: https://github.com/attic-labs/noms

Not sure what the status of that is since Attic Labs was acquired by Salesforce.

alecthomas · on Sept 1, 2020

Alas https://github.com/attic-labs/noms#status :(

> Status > Nobody is working on this right now. You shouldn't rely on it unless you're willing to take over development yourself.

tantalic · on Sept 1, 2020

Noms has gone pretty quiet since the acquisition. Dolt is based on noms (not sure how much is has diverged) and much more active: https://github.com/liquidata-inc/dolt

aboodman · on Sept 2, 2020

FYI, some of us have started working on https://replicache.dev/ instead, which shares some DNA with Noms. But it's not decentralized.

sharabara · on Sept 1, 2020

Upvote for Syncable & Embeddable. It will take a massive amount of time to build from ground up, so I wonder if it's sensible for you to build SQL on top of RocksDB like CockroachDB did: https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/

naranha · on Sept 1, 2020

Aren't relational AND syncable very difficult to combine? Foreign key references for example may not yet exist on one instance when they are synced. P/CouchDB and Git support multi master sync which is great, but they are not relational.

The workaround is to detect foreign key violations in your app with indexes or a job that runs on the nodes.

sradman · on Sept 1, 2020

Log shipping works well for single master scenarios and these solutions can be combined with a partitioned primary key space for inserting new rows in the remote databases.

A two-way embedded/server relational sync is a much more difficult problem but it has been solved before. SQL Anywhere is a commercial relational database that has exceptional synchronization capabilities for fleets of mobile devices. Its MobiLink gateway synchronizes with heterogeneous DBMS servers like Oracle.

A new relational sync project should become familiar with MobiLink, in my opinion.

ultrasurf · on Sept 1, 2020

I think what's even better than relational may be directly using OOP objects, using the system prevalence pattern.

If you're willing to model your "DB updates" using event sourcing, you could fork something like Prevayler to achieve your immutability needs while being able to serve as source of truth across multiple nodes.

eyelidlessness · on Sept 1, 2020

OOP isn't immutable

ultrasurf · on Sept 1, 2020

Technically, sure, but you can always deep-copy/serialize/etc. to make it immutable.

My point is that you can achieve all the OP's needs without representing your model in SQL.

sooheon · on Sept 1, 2020

SQL was not one of OP's requirements.

_y5hn · on Sept 1, 2020

OOP encapsulates implementation, so you don't really know unless you break OO.

dwohnitmok · on Sept 1, 2020

It's very difficult to have traditional OOP with pervasive immutability, because it almost entirely removes the idea of a stable instance of an object. There is no single instance that you call methods on or send messages to. Every method call or message processing must produce a new instance.

More generally, encapsulation fails here. Immutability cannot be encapsulated behind the same interface as mutability. An immutable interface by necessity will look different from a mutable interface (at the very least methods that would normally return `void` in a mutable interface cannot ever exist in an immutable interface).

smabie · on Sept 1, 2020

Scala is mostly immutable and OOP+FP works very well with it.

Also, just because an object is immutable doesn't mean that every method is going to return a new instance of that object. For an example, a 'sum' method on a list of integers object would return an integer, not a list.

dwohnitmok · on Sept 1, 2020

As a general rule of thumb Scala is precisely OOP when it's mutable and FP when it's immutable. The fusion occurs exactly at the boundary between mutable and immutable code.

Yes if you have only read-only methods then immutable and mutable methods coincide, but that's precisely the place where mutability doesn't matter to begin with.

Eventually somewhere in your code you have to write something and that's where mutability both matters and results in different APIs.

However, you're right that my initial statement "Every method call or message processing must produce a new instance." is way too strong. I should've meant every write-based call (or message processing).

_y5hn · on Sept 2, 2020

Good points. There's also the minority notion that you can use OOP where it design-wise makes sense, and procedural/functional/whatever where that makes sense. Of course, if code becomes too esoteric, you risk making a special snowflake noone else is willing to look at.. So probably kind of explains why people take an all-or-nothing approach is most general cases.

My point then is to fully utilize OOP where applicable, to gain the advantages of orthogonal design. This special effort in design may take many attempts to get right, in order to provide more flexibility and power than in initial iterations.

sroussey · on Sept 1, 2020

Gotta figure out what your needs are. For example, if browser and node is a requirement than you can look at RxDb which uses Pouch and either Indexeddb or various file options.

If you want everything, it’s going to be big and not appropriate for web or mobile unless part of browsers or the OS.

dsun175 · on Sept 1, 2020

Also I see a big conflict in wanting relational data and a fast/secure replication protocol.

mikealrogersHN · on Sept 2, 2020

I have some good news for you :)

Hi, I’m Mikeal, I created PouchDB. After a year or so I handed that project off to some great maintainers that made it much better as I had grown a little skeptical of the replication model and wanted to pursue some alternatives.

It’s been about 10 years, much longer than I thought it would take, but I have a young project that finally realizes the ideas I had back then.

Sometime after PouchDB was created I realized that it just wasn’t going to work to replicate a whole database to every client. In fact, even a view of the database wasn’t going to work, because the developer isn’t in a position to really understand the replication profile of every user on all of their devices, you need a model that has partial, or more accurately “selective” replication based on what the application accesses in real time.

I became convinced that the right primitives were already present in git: merkle trees. Unfortunately, git did a very poor job of surfacing those primitives for general use and I wasn’t having much luck finding the right approach myself.

Shortly after joining Protocol Labs I realized they had already figured this out in a project called IPLD. Not long after that, I started leading the IPLD project/team and then putting together my ideal database whenever I found a free moment.

It’s very young, lots of missing features, still working on some better data-structures for indexing, but it is very much a database that replicates the way git does and approaches indexing over a primary store the way CouchDB does, but there’s a lot more too.

With these primitives we can easily nest databases inside of other databases (and create unified indexes over them) and we can easily extend the data types in the database to user provided types. Using some of these features it already supports streams of binary data, databases in databases, and linking between pieces of data.

Let me know what you think, there’s still plenty of work to do, but it has 100% code coverage, so it does what it says it does :) https://github.com/mikeal/dagdb

silasdavis · on Sept 1, 2020

Reminds me of https://github.com/davebryson/bftdb: sqlite replicated by Tendermint (PBFT-ish consensus)

rbjorklin · on Sept 1, 2020

It sounds like Irmin combined with the GraphQL bindings is what he wants: https://irmin.org/

dwenzek · on Sept 1, 2020

Indeed, Irmin is a library to implement data stores which can be freely cloned, updated and merged along a model similar to git.

It separates the content from the actual store. So it can run as well on a file-system, on a git repository, in memory or in the browser (See for instance https://github.com/talex5/irmin-js).

As git, Irmin uses 3-way merge but applied to arbitrary values and not only files. As described here http://gazagnaire.org/pub/FGM15.pdf and generalized here https://kcsrk.info/papers/oopsla19-mrdt.pdf

brynb · on Sept 1, 2020

Thought I'd share a project I've been working on over the past year that meets at least a subset of the criteria from the article:

https://github.com/brynbellomy/redwood

Redwood is a decentralized database providing a data model that's much like a Redux state tree. One of the fundamental design goals was to better harmonize how data is handled on frontend and backend layers.

It's also:

- syncable/offline-capable, thanks to a CRDT system

- able to serve as a standalone Git backend

- embeddable into Go applications

- immutable, in a blockchain-y sense, due to its append-only ledger

There are some demos in the Github repo:

- A real-time collaborative text editor

- A chat application

- Git integration

- A video streaming application (still improving performance here, but it's functional)

It's pretty alpha still, but feedback and contributions from anyone interested would be very welcome.

thomasfromcdnjs · on Sept 1, 2020

I think the more important question he didn't answer was "Why?" he wants a database like that.

bambam24 · on Sept 1, 2020

I have Been there, Over engineering kills startups,

phil-martin · on Sept 1, 2020

It's a super interesting set of requirements - and I've been mentally trying to wrestle with an extra ones on top of that - selectively syncing data.

Let's say I've got three users of my bookkeeping app. Person A and Person B are in different companies, totally separate data. Person C is a bookkeeper and wants to work with both their sets of data.

It's a trivial example where you can just say "well you just keep Person A's and Person B's data separate and switch what the bookkeeper gets to see when they want to reconcile separate accounts".

I don't have any good solutions for it yet, but I'm enjoying exploring them.

arxpoetica · on Sept 1, 2020

There's some cross over (not 100%) with an idea I've felt is needed in the data marketplace. See links:

[1] https://github.com/datasynq/synq [2] https://arxpoetica.com/articles/2020/07/20/why-data-needs-a-...

jrockway · on Sept 1, 2020

I just want a database that takes no configuration options except for a list of peers and a disk partition to write to and provides strictly serializable transactions 100% of the time (modulo partitions).

I'd rather start with slow and correct, and implement hacks to improve performance once slow becomes "too slow". But instead, database engines are sold on benchmark performance, so out of the box they lose transactions they told you they committed, or read uncommitted writes, or all sorts of other crazy things. I don't want this unless I write code myself to do those things. (If I go through the API, it must be completely consistent. If I clone the drive of the database server and pick around at random disk blocks, sure, maybe I'll see some uncommitted writes.)

implfuture · on Sept 1, 2020

Check out FoundationDB! It’s a distributed key value store with strictly serializable txns by default.

jrockway · on Sept 1, 2020

I really like FoundationDB. It still feels like that system where if it breaks you will be blamed aggressively for picking a weird thing, so I'd like to see some more success stories, I guess. But if I really care, maybe I should be the success story.

The Go API is also kind of weird, and conflicts with my general thought on Go code that every I/O operation should be cancelable with a context -- they really don't subscribe to that point of view.

Those are the main reasons why I don't use it... but I guess they have the fundamentals down (the "foundation" if you will), so it's probably worth powering through those nits and start using it.

rodmena · on Sept 1, 2020

I am looking for a hero to test that 280TBs limit of SQLite.

nanomonkey · on Sept 1, 2020

Sounds like Hypercore (DAT protocol) or Secure Scuttlebutt.

sbazerque · on Sept 1, 2020

Yeah, except Hypercore or SSB sync'ed data types are not as flexible as the relational model.

I'm building something like this, with one extra requirement: permissionless / trustless p2p operation. It works in the browser, here's the README:

https://github.com/hyperhyperspace/hyperhyperspace-core

nanomonkey · on Sept 1, 2020

I'm not sure what you mean by not as flexible as relational model. I'll address SSB, as I'm more familiar with it, as I've built a CRDT system on top of it in clojurescript.

Each message can be linked by its content addressable hash. The engine allows for Map/Filter/Reduce queries which are not much different in capabilities from SQL queries, these can even be indexed to speed up views. The only real difference is that no schema is enforced on each message type, but it's not that difficult to enforce something like that on a higher layer when publishing entries.

If you think about it, storing JSON entries is a lot more flexible than table rows. Also, the content stored need not be JSON, but EDN, transit or reference to binary blobs.

sbazerque · on Sept 1, 2020

> Each message can be linked by its content addressable hash.

What I had in mind is, while as you well said, you can link the data in a pub in any way you need, the distribution model is fixed. The only way to change that is creating more pubs.

I think the distribution model is an integral part of the data structure in a general p2p setting and needs to be customizable, within reason.

Cheers!

nanomonkey · on Sept 1, 2020

Thanks for your response, I find this stuff super interesting to watch evolve.

I think I understand a bit of your concern, although note that the pubs aren't necessarily required for replication in SSB, one can use onion routing, physical devices (sneaker nets), local area connections, etc. But I agree, that there are more areas to improve, such as changing the network/capabilities keys on the fly from within a message to ratchet the distribution, making reference to content hashes within different network keys, etc. I'm curious if your work takes any of this into account? Thanks again.

andi999 · on Sept 1, 2020

Maybe one can learn from and avoid the biggest mistake of sql: mixed commands and date. Maybe one can separate the command from the data in the control.

garfieldnate · on Sept 2, 2020

Yes please! I am all for anything that helps users own their own data. See also Solid (https://solid.mit.edu/) and "local-first software"/CRDTs (https://www.inkandswitch.com/local-first.html).