Databases we have today aren't designed for the kind of consumer applications we use today. They are designed for banks, Amazons and their smaller counterparts - where a central entity (the business or corp) needs full control over the data.
The most popular consumer apps (FB, LinkedIn, blogs, photo sharing, goal trackers, slack, basecamp etc) shouldn't be centralized in this fashion. Data(bases) should be owned by users, and synced with friends and colleagues. So instead of a single centralized database, we need a million tiny little databases which are synced with peers.
Tiny databases fit well into today's hardware capabilities landscape. We have plenty of fast storage, and powerful CPUs on computers and phones. It's time to stop building as if the server is the only place where queries can run efficiently. Truly serverless, that is. This also improves data security in general; there is no central place to hack.
Happy to talk about this in detail. Please feel free to email me.
>Databases we have today aren't designed for the kind of consumer applications we use today.
Yes they are? Relational databases are very good for consumer applications with relational data. NoSQL databases are good for everything else.
>Data(bases) should be owned by users, and synced with friends and colleagues.
Why? You use words like should - why SHOULD it be that way? Users are able to create / buy / rent infrastructure to manage their own data - and yet they don't ... why?
>Shouldn't be centralized in this fashion. Data(bases) should be owned by users, and synced with friends and colleagues. So instead of a single centralized database, we need a million tiny little databases which are synced with peers.
Nice in theory (maybe). The problem is that centralization provides enormous quality and performance metrics. Distributed systems are resilient, but slow, and require a lot of care to maintain consistency and integrity of the network. Take email as an example. It took us 20 years to get email spam somewhat under control and a major part of that was creating a centralized infrastructure (from using centralized email providers to third party whitelists and blacklists). It's gotten to the point where hosting your own mail server is a huge hassle and pretty much impossible for regular people. I spent about a week trying to figure out why our Office365-originating mail wasn't being received by one of our customers. It turned out, their mail provider matched a particular phrase in our automatically-inserted disclaimer and completely rejected the mail.
Decentralized systems also make controlling your data all but impossible. If I want to remove my data from Facebook, that's the only entity that I need to deal with and it's relatively easy for regulators to set sound policies in place. In your distributed system, once your data leaves your node, you've lost all control all of it forever.
> Decentralized systems also make controlling your data all but impossible. If I want to remove my data from Facebook, that's the only entity that I need to deal with and it's relatively easy for regulators to set sound policies in place. In your distributed system, once your data leaves your node, you've lost all control all of it forever.
That's already the case though. When I ask Facebook to delete my data they delete their copies of it. However, anyone who has downloaded photos, any advertisers who have scraped that data, any archival attempts, any search results, and a whole slew of other partial copies of that data still exist. It's almost better for data to be explicitly distributed to make it more obvious that it can't be easily removed.
> where a central entity (the business or corp) needs full control over the data.
Or where we want some central control. Until we solve decentralised spam, I really don't want to share databases directly with other users. If you can generate a new identity yourself and then inject your own content into other databases, that's a recipe for pain.
Federation kind of works, but I can't imagine a fully distributed slack for example. I'm on matrix and get the "elon musk wants to share bitcoin" images every day from telegram bridge.
Even in email, spam is still not really solved. I still get both false positives and false negatives.
> This also improves data security in general; there is no central place to hack.
I see your point here, but would also counter with: degraded security because we need N databases to patch every issue after the issue is public, rather than 1 database (potentially) before the issue is public.
> Even in email, spam is still not really solved. I still get both false positives and false negatives.
See my comment about Secure Scuttlebutt below. Email is a little different, in that complete strangers (and apps) should be able to reach you. However, that's not the case with FaceBook, Slack, Instagram etc. If you sync exclusively with people you know (or your organization) and their connections, spam won't exist by design.
What you gotta do is to model online communities similar to the physical world. In meat space, you choose the people who you are accessible to. That'll work here as well.
> I see your point here, but would also counter with: degraded security because we need N databases to patch every issue after the issue is public, rather than 1 database (potentially) before the issue is public.
WASM-based db engine could be a good option, if the app is browser based. We could treat it as part of the app, and the standard security considerations apply.
> The evil CAP theorem enters the town here. Consumers want things to work fast, it doesn't fit the decentralized approach.
I think we have to distinguish 'decentralized' from 'distributed' in this case. In a decentralized approach we reduce (by many orders of magnitude) the size of data we're dealing with. Locally installed/browser-based applications only have to deal with data that's created by, or directed at that user (and locally stored).
Add: For a number of applications, hard consistency guarantees are overkill. You could totally live with eventual consistency (some don't even need that). In fact, other than e-com and banking I can't think of any of my daily-use apps needing such guarantees. And indeed, decentralized data storage is a bad fit for banking and commerce.
Network hops cost time regardless of payload size. Distributed consensus (which is what is required for a “decentralized” system) requires interaction between distinct nodes.
You should note most consensus algorithm designs start with sharing a single ‘register’ (or ‘value’).
Most banks were doing “decentralized” relaxed consensus from the beginning of “banking”. Local office accepts your deposit of your check. “Funds available next day”. Because by “next day” they have confirmation from counter party for the funds.
You are correct but part of this can be alleviated by aggressive caching by the client devices and delayed (optimistic) updates -- to a large extent though. Doesn't apply to all problems in this space.
FB, LinkedIn and Slack are exceptionally happy that it is centralized; what do you mean “shouldn’t be centralized”? Who cares about what’s better for the user? He’s not the customer anyway.
All of these apps chose technology that was available to them when they started building. So the focus could be to enable building modern usecases on decentralized data platforms.
If one of them becomes the next FaceBook, personal ownership could become a baseline expectation for similar apps. Once data is owned by users, apps become a layer on top of data which can be substituted as needed.
Sure. But there are good examples of decentralization making spam difficult or impossible.
One such is Secure ScuttleButt (SSB). You sync with friends and their friends (and their friends maybe, based on configuration). Now people aren't generally friends with spammers so you don't see their feeds. And if they are, a sufficient number of your friends would be blocking them - once a person is blocked by a certain number of people, stop syncing with them.
I am not convinced that SSB is a solution for all types of apps, but for Social Networks it works darn well. Oh, and zero ads too because you'd never sync with ad networks.
I don't know. Facebook can spot fake profiles because of the centralisation, not sure how that'd be avoided in that system. Also, if users are really anonymous, the system will end up turning into a cesspool of disgusting and illegal stuff. At least that was the fate of all peer to peer social networks I'm aware of.
That would be handled client-side, like email. You could build in a standard set of rules, and then have some kind of ML process profiles you manually mark as spam. Just have it filter profiles in API requests through the fake profile detector.
> At least that was the fate of all peer to peer social networks I'm aware of.
That doesn't seem true of Mastodon. The few servers I've been on were fairly tame. I think the worst I saw was someone posted a sexualized anime-style image, but there wasn't any nudity.
I didn't immediately spot anything weird on Diaspora either, but I only spent like 45 minutes on there.
> Also, if users are really anonymous, the system will end up turning into a cesspool of disgusting and illegal stuff.
Anonymity can be okay, but only in the presence of heavy moderation. HN is basically anonymous if you want it to be, but it stays on target because of the moderation. You can see it on Reddit too. AskHistorians is a wonderful place to be; the subreddits without even basic moderation do indeed turn into cesspools.
Federation has historically made this difficult because you can only moderate your own instance, which is often a minority of the content you consume. There are potential solutions for that, though. You could allow users to "flag" posts, and add those flags to the data you federate. Then, like the fake profile system, you could filter them based on user parameters. I.e. hide posts that have at least 1 "racism" flag.
If necessary to prevent brigades, you could also filter the flags that you accept based on the user that flagged the post. Maybe weight them based on similarities in accounts you follow.
I have to admit I haven't look into p2p social networks in years. The ones I've tried in early 2000s (such as freenet) had really dodgy users. Perhaps a lot was learned since then.
Anyway, I agree with you that what really prevents people from using the system for nastiness is moderation and not necessarily identity. However, the problem is not only fringe and extremist beliefs, but actually becoming a platform for actual crime, through sharing of photos and videos of abuse. That's what worries me the most about p2p social networks and the reason I'm generally not ok with storing other people's data in my computer, even if encrypted.
A lot of facebook's value prop for users is in what you're saying - not as a social media company, but as a centralized identity provider for the social graph.
Perhaps a decentralized solution could still use a centralized identity provider to help with things like spam & impersonation/fake profiles and prove who they are, while still allowing users to own the information _about_ themselves.
The problem here is incentives; there's not much money in being a dumb identity resolver unless you're harvesting everyones data AFAICT, so I sort of doubt a private consumer solution will emerge in the near future.
It’ll never happen (in the United States at least) but this then sounds like a service that should be run by the local government, or perhaps something more global. People on HN often talk of making the internet an actual public utility, perhaps something like centralized online identify would follow a move like that
yes absolutely. In my scifi utopia, the government provides a digital identity to each citizen as a social service :)
One of the downsides of this is that it actually degrades privacy by improving private institutions (and the governments) ability to track you.
One way I've thought of to get around this is to allow people to spawn new identities from their own root identity. This could allow a certain amount of anonymity, while allowing institutions to filter identities by requiring a certain amount of social proof (e.g. length of time since identity has been created).
Some services might ban non-root identities (e.g. banks), while others (like a bar) just needs to be able to trust that you have a valid ID and can be used to prove you're over a certain bio age.
In the event that you need to prove you own a specific ID, there could be a way that given a root ID and a non-root ID, you re-generate the non-root ID from the root - thus proving you own it.
I think it’s a naïve position. Mail was decentralized. Now it’s mostly federated among 4-5 big providers.
Historically, people value low cost and convenience over any other feature. Unless there is some GDPR like directive requiring decentralization and self hosting, it will likely stay centralized and become even more so.
There are so many more email providers being used worldwide. Who are the 4 or 5 big providers everyone is using? gmail, yahoo, hotmail, fastmail and protonmail? Don't forget outlook and zoho
Many people still use aol or their email given by ips or work email domain.
Worldwide yandex.mail is popular.
Colleges still give email addresses.
People like myself run their own
Vanity domains are popular.
Mail is still decentralized and news of its death have been greatly exaggerated
Google, Fastmail, Microsoft(Outlook/Office365/Hotmail/Live) already cover something like 90% of one side of emails, and 50% of both sides (that is, 50% of emails originate AND terminate among these three, and 90% originate OR terminate among them). I haven't seen any recent stats, but in 2017, 50% of emails already originated or terminated with Google, one provider.
The significance of "originate OR terminate" is that the provider has the ability to meddle or outright ban, either going out or going in.
> work email domain.
> Vanity domains are popular.
> Colleges still give email addresses.
Most universities, workplaces and vanity domains are running within the Google+Fastmail+Microsoft sphere, regardless of the domain name. Every university I've worked with in the last 10 years (e.g. NYU) was using a G-Suite or Microsoft hosted solution, as was every single workplace.
> People like myself run their own
There is a very long tail for the last 1%, but likely 99% of the users are served by 10 companies worldwide. I call that federated, not distributed.
I used to run my own server as well, for ~15 years. Until one day, google stopped delivering it to gmail or g-suite addresses (sent it to spam). There was no one to talk to to figure out why. Which turned out to be ~70% of email I sent -- despite having SPF, DKIM, DMARC and everything else properly set up. Eventually, I switched to fastmail hosting because I have better things to do with my time than figure out why Google is blacklisting me.
It's federated, not distributed, and we're all at the mercy of the big few.
Most of the responses to your post are critical, but I couldn't agree with you more. Having a local database would solve all kinds of performance, connectivity and privacy problems. Transparent background synchronization would be beautiful.
Unfortunately, this is a really hard problem to solve. Sync is a bear of a problem in the face of conflicts. An immutable datastore with a log file of all mutations would help a lot, but you'll still have to manage the conflicts. This is not only a back end problem. What kind of user interface do you present that actually makes sense to an untrained user? "Bill changed Mary's name to Marybeth, but Fred changed it to Mary Beth. Which is correct?" Most users will say "Whaaaa? Why are you bothering me?"
> The most popular consumer apps (FB, LinkedIn, blogs, photo sharing, goal trackers, slack, basecamp etc) shouldn't be centralized in this fashion. Data(bases) should be owned by users, and synced with friends and colleagues. So instead of a single centralized database, we need a million tiny little databases which are synced with peers.
> Tiny databases fit well into today's hardware capabilities landscape. We have plenty of fast storage, and powerful CPUs on computers and phones. It's time to stop building as if the server is the only place where queries can run efficiently. Truly serverless, that is. This also improves data security in general; there is no central place to hack.
Sounds like you're describing the MetaCurrency Project's Holographic chain pattern that is currently being developed:
Gitsta is work in progress; nothing to show yet, sorry.
I had looked at holochain earlier, but to be honest I found their literature a bit difficult to grasp (with a quick read). But I was aware though that they're trying to solve the same problem.
My view is that once you are able to reduce the amount of data an app has to deal with, you might even be able to rely on plain text files, yaml collections etc. for data storage/exchange; and fed into a db engine for faster querying.
> I had looked at holochain earlier, but to be honest I found their literature a bit difficult to grasp (with a quick read). But I was aware though that they're trying to solve the same problem.
I posted your feedback into the Holochain forum, and used it as an entryway into asking some questions I am grappling with:
Do I have your permission to link it to our comments here?
> My view is that once you are able to reduce the amount of data an app has to deal with, you might even be able to rely on plain text files, yaml collections etc. for data storage/exchange; and fed into a db engine for faster querying.
If I understand you correctly then I think this is something Holochain's developers are tackling too. I think for the purpose you describe, they put a big focus on relationships and p2p actions that are 'gossip-ed' (spread) through a content addressable distributed hash table (if me mentioning this strategy seems unrelated, it shows the limitations of my technical competences).
-
If you're interested to engage, I'd love to know what things you found difficult to grasp in your quick read? This is a big subject with a lot of entry points, so my question is a broad question, on purpose. I'm interested in which functional use cases you are seeking/developing, and whether your critique of the current centralized architectures is similar to Arthur and Eric's critiques (the two Holochain originators), versus whether we have reached - and agree on - the same conclusions. The first one (critique) being more important to me: do we see the same challenges/shortcomings? I hope that make sense.
From my understanding Holochain's main thesis is that centralized software obscures 1) an app's inner logic/workings, and 2) the relationships between peers and the data they create (as well as the shape and size of the data they generate).
Through my own my research the overarching patterns I found is that corporations monopolize blueprints. Through the use of centralized black box software, they can also act as middlemen, extracting parasitic rents. They can control and manipulate us by tracking our behavior and steering us into the direction it's algorithms (and often the profit motive, e.g. advertising) have decided for us.
Society's innermost services (like money, law, identity, intellectual property databases, healthcare, land registry) are often invisible or semi-invisible architectures (they require a high clearance level, or the information is stuck in old proprietary formats that make them expensive and difficult to maintain). Because these services are still centralized, they evolve very slowly, meaning that their use remains limited.
A big part of my interest in Holochain comes from the work of the Mikorizal.org team (Bob and Lynn, and Pospi) being done on Open Value Network economic protocols. Using Holochain they are building a Resource Event Agent (REA) acounting model that is completely P2P. Their projects aims to provide an alternative to the closed corporate Enterprise Resouce Planning (ERP) software that is the norm today. I think this new paradigm can help us to see more clearly the actual scarcity of natural resources, as well as use it as a tool to cooperatively fund and develop open source hardware (and software) in the open.
I mean public in so far as it just happens without you thinking about it, that could be commercial too, its just more pervasive as public. When such infrastructure becomes a _guaranteed_ aspect of a given operating system. But until then people implicitly _want_ centralised.
One healthy middle ground here is the situation with git. Git is fundamentally decentralized, but I can use it in a centralised way if I want to via services like github, gitlab, bitbucket, etc. Those services allow me to collaborate with other users and back up my projects. And because it’s decentralised, I can trivially move / clone my repository to other providers (or to my own servers) without losing any data. (... Well, except for github issues; which arguably should be stored in repository itself.)
I can imagine a similar situation for a small database too. Live clone to Dropbox or something similar. The user gets all the benefits of a centralised system but stays in control, and they can switch providers whenever they want.
ye but you know in practice that means at least 95% of users are on the centralised version. We continue to build up this 5% of technocracy that have different rules applied to them because they know enough to care.
I think it is a common misconception that banks use database transactions the old fashioned way.
They seem to be more like eventual consistency. I initiate a transfer and there is some check about available balance (although with various overdrafts, etc. it is unclear how strict), then it is marked pending and eventually the transaction succeeds or fails days later (an international transaction).
This is conflating things. At a database level it is absolutely ACID compliant, full transactions are used for situations like you mention within one bank.
The fact there may be interplay between custodian banks and various triggered events has nothing to do with the type of database consistency features of an RDBMS.
Yeah, it is being around forever. Erisson or some Nordic company use it for phone switches. It is design for concurrency. Personally I think everything pretty much has being invented, it is all about people looking for it and bring into fashion.
I actually feel like Fossil could be adapted to do this fairly easily. It uses Sqlite (it's made by the author of sqlite) so it has the same downsides described in the article, but if you could solve those problems I think it does pretty much exactly what you and the author are describing.
I'm longing and has the need for this type of database, and the time is ripe now for someone to implement it. It should be a local first and networked/distributed second, since IMHO the latter feature should be the sync problems not database problem per se.
The closest to this idea is SQLite (open source and mentioned in the article) and SQL Anywhere from SAP (commercial and not mentioned in the article). But as the article pointed out most of the current solutions rely on SQL.
I'd love to have the new database system to be open source and uses modern open data format for examples TileDB, Arrow and Parquet. The computation part should be language agnostic and independent of the programming languages such as JuliaDB or personally I'd prefer D programming language.
Coincidentally, around seven years ago Bryan from Joyent (Joyent's presentation is mentioned as the main motivation early in the article) has blogged about this similar effort targeting POSIX systems (specifically ZFS) that eventually becomes Manta and now part of Triton a.k.a SmartDataCenter [1]. For details implementation please refer to the original paper [2].
That depends on the programming language, for example C# with LINQ and functional programming languages are pretty good at expressing queries. SQL has no capability for abstraction and so only the simplest queries will be succint, if you want to e.g. union two similar queries then you'll have to copy and paste.
C# LINQ is basically very roundabout way of doing what SQL was made for. What SQL does, and does really well, is operate on sets of data (aka tables) using set theory. You can get succinct expression with C# and functional languages, but they are not really possible to optimize the same way SQL is because it's not bound by set theory the same way.
SQL by default operates on multisets instead of sets; iIt has no syntax for a set (aka table) literal, SELECT 1, "a" FROM DUAL UNION SELECT 2, "b" UNION ... doesn't count; sub-queries (using another SELECT in a FROM clause) are a pretty recent addition; and the list goes on. It's not based on a set theory, or on the first-order logic, or the relational model, not really. If it were, it would be be Prolog.
Regarding table literals, some products support using
VALUES (1, 'a'), (2, 'b'), ...
as a table constructor in general, not just in INSERT, and looking at SQL:92, SQL:99, and SQL:2003 it looks to me like this should be standard SQL. Derived tables, aka subqueries, exist in SQL:92 at least as well, so whether or not that is considered recent depends on how you look at it, I think.
I am not trying to defend SQL with this, and all in all this does not take away from the points you raised, but the above were something that stood out.
They were in SQL:92, but IIRC adoption was somewhat slow and patchy, and has generally finished somewhen in the early 2000s.
My point is, the parent's claim that SQL "operate[s] on sets of data (aka tables) using set theory", and does it "really well" but when you actually look at SQL, you realise that sets/tables aren't really first-class ― derived tables were added in later versions of SQL, and literal tables still don't exist, but those are things you expect a language focused on table manipulations to have. Nope, it's a language for building very specific kinds of queries which was then patched and extended into something more general.
I find most strongly typed functional languages get too hung up on the structure of the rows. E.g. what happens to SELECT user_name, price FROM orders JOIN users ON user_id? You might end up with something like this:
But that will quickly get messy if you have multiple tables, and you have to explicitly say how to order joins (which you don’t do in sql), and something which just analyses the function calls (ie can’t look inside the lambdas) can’t really know that it only needs to look at a few columns of the tables. So maybe instead you need special values to represent the columns of your tables but then how can you represent the necessary renaming when you join one table multiple times? And how will you describe computations with those fields?
Ultimately I think this is a problem largely involving bindings and sets of things and that these are difficult (if not impossible) to move to the type level while providing a nice api. So you could either get a risky api that can throw runtime errors for queries that could never be valid or you get issues like the ones I described above.
Datasets there work like [(column_name, value)] heterogenous lists would work on Haskell. It does really not get that problematic. (All the problems go away in Haskell if you decide to use maps too.)
But yes, it throws runtime errors. The nature of retrieving data from a foreign service is that it will throw runtime errors. There is some verification you can do, but a strong typed database layer just makes promises it can not fullfill.
With Common Table Expressions you can process the shared data in one set ("temporary table" is a good mental model for it), compute the unique parts in subsequent sets and use the results for the UNION in a single query.
Linq can be used in a declarative way such that the AST is available and can be rewritten for optimization or transformation say to SQL [1].
Not that this is easy but many project utilize it and the relinq project tries to give you a more usable starting point [2].
After having done SQL for years then Linq, I actually prefer the more explicit operation of Linq by default, I know the order of execution is the one I specify. Also being able to use a full rich imperative language in the query is so very useful.
I experimented with various approaches to client-side data handling and ended up realizing that I do not want a client-side database at all. I'd much rather hold data in language-native data structures (in my case, Clojure/ClojureScript). These are very powerful and there is zero impedance mismatch: after all, they are a part of the language.
Yes, you do need queries, but there are not many and they might as well be coded as functions. You can do all the indexing you want, if you need to, but JavaScript in browsers is so incredibly fast these days, that even a linear search works well for most cases.
Unless you are handling gigabytes of data client-side, of course. But that wouldn't make much sense.
Data modification needs to go through a server API anyway, because the client environment can't be trusted, so two-way sync would be useless. I want the server to perform data model operations while verifying everything, I do not want data sync.
>I'd much rather hold data in language-native data structures (in my case, Clojure/ClojureScript). These are very powerful and there is zero impedance mismatch: after all, they are a part of the language.
>Yes, you do need queries, but there are not many and they might as well be coded as functions. You can do all the indexing you want, if you need to, but JavaScript in browsers is so incredibly fast these days, that even a linear search works well for most cases.
That's just a bad database. It will work for small amounts of data and simple queries. But add in something complex or lots of data and it will fall over.
In theory, if it's all in a single application's memory and using langauge-native data structures, then you have a lot of power to tailor the data model to meet your application's needs. Barring concerns like object overhead (which I realize is a major consideration in languages like JavaScript and Clojure), you should be able to come up with something that is more efficient than a general-purpose solution like a database without too much trouble. It's really just a special case of the Chuck Moore style, "I coded my own $thing in 1/10 the code and 10x the performance of the popular off-the-shelf option, in a week. NBD, it's easy if you simply choose not to implement any of the features you don't need." thing.
Where DBMSes truly shine is working efficiently with data that you can't or don't want to keep in memory all at once, or for which lots of different clients need concurrent access. The first of those (but hopefully not the third) are absolutely considerations for certain kinds of browser apps, but maybe not something the parent was worrying about.
I've long thought there's a potentially-useful/missing piece of software infrastructure that's midway between code or data-written-as-code and a full blown database (even one as small as Sqlite).
I have to question the "relational" part. SQL is an absolutely massive language, to the point where "we support SQL" is almost an empty statement because there's no single SQL to support.
While revamping the database, consider a smaller query language. I built a GraphQL-based database query language in my GraphQL compiler project[1], and it's been a huge success at my company so far[2][3]. Here's a demo of cross-database querying with it, if you're curious [4].
I'm building a relational lang (http://tablam.org) so this is close to my interests.
SQL is not just massive, it have grown past it actual capabilities and original purposes without fix or clean it, then ALSO need to be the interface for some of the most complex piece of software (rdbms) and support the needs of a multitude of features, frameworks and languages. Is like html/css/js.
Going to basics, the relational model is fairly simple, and you can spice it with functional and/or procedural features to round it and still keep it very small.
So yes, I think it can be done MUCH better than SQL and still do MUCH more than it can right now...
P.D: My language TablaM (will!)is basically this plus a little of array programming. You can fit very well the relational model in a lisp/tcl/datalog like environment with a small core that is very expressive...
In fact "TablaM" come to sound like "Kablam!" https://www.youtube.com/watch?v=9glnVMq1xrE, a nickelodeon weird program of the past (when it do a lot of cool things). I have a habit of put weird names to my projects...
Then I morph into tablam, and later in the game I create a pseudo-plausible rationalization where "Tabla (spanish)" == "Table" and M is for "Multi-paradigm, -platform, -dimension, etc" that is the one explanation I will use in case get serious people asking me :)
I'm not really an expert on the field. But, have you considered that maybe SQL is massive for good reasons?
The subset of SQL that covers majority of use cases is actually small and probably mostly covered by the standard.
All the rest of SQL which makes it a massive beast is for all those cases where adding the logic to the database would've been the most efficient for all those who need that extra bit of efficiency.
I think SQL is massive because of two major reasons:
1. It has a highly inconsistent, non-generalized syntax
2. It tries to reuse the language for everything -- from schema specification, to querying, to administration, to query engine hints, to procedural language definitions
The RDBMS in total is a highly complex system, but SQL adds a lot of accidental complexity on top of it.
Also for a new DB, you hardly need to support the whole suite of features to produce a useful system -- you just need to constrain the recommended usecases. See SQLite for the quintessential example -- they dropped MVCC and a lot of management/multi-tenancy features as they targetted a surprisingly large niche.
As it turns out, there's a wide variety of data management needs -- SQL is often a beast because most RDBMS's have a goal of targetting all of them.
It's slightly more likely that modern SQL is the byproduct of several query language specifications accelerated to ludicrous speeds and set to collide with each other. Out one end of the apparatus, SQL-86 emerged.
A little known fact about this experiment: the residue splattered along the interior of the apparatus was recovered and named LDAP.
Why is it 'massive'? I don't think it is. Compared to say C++, or programming in python where the language is smallish but the libraries considerable, I can't see why you say that.
I can’t help but be extremely pessimistic. This is an insanely challenging project. Looking at the author’s other side projects: https://euandre.org/about.html
I’d guess this project is about 10,000x-100,000x as challenging as the Songbooks one. And the author has no obvious experience writing DBs, that I can see?
This project seems at least as ambitious as CockroachDB. Development started on CockroachDB 6.5 years ago, was founded by a group of Google alums with relevant distributed systems experience (working on GFS), and it since became well funded and has had a sizeable team working on it full time for years. This project would be almost impossible to make significant progress against without a strong, well funded team working on it for years, but the author speaks about it as a side project.
What does "10,000x-100,000x as challenging" mean? If the songbook cost him one week, this will cost him at least 200 years? I love a good hyperbole, but this is just disparaging.
Most databases that we use are written by hundreds to thousands of people and take years to become reliable and mostly bug-free. It can easily take much more than 200 programmer-years.
By that yardstick nobody should ever attempt to write an OS or programming language or game engine from scratch. I am glad there are plenty of individuals who try.
> you will never have enough time to write something comparable to modern Linux or Unreal alone
That is only true if you're retaining their respective requirements and architecture. Not necessarily a good idea.
Linux supports tons of devices and architectures very few people are using: people are only paid to add stuff, no one is paid to remove it once no longer needed, so it tends to accumulate. Writing an OS in C is not necessarily a good idea in 2020, if I would be interested in writing a new OS I would do it in .NET.
Unreal engine is not that complicated in comparison to modern Linux. I think when people are watching modern video games or technical demos and saying "wow", they overestimate complexity of code, and underestimate complexity of assets.
ye but be realistic. If you're just one dude that has no experience in creating operating systems then you shouldn't seriously try to write one from scratch.
For giggles, sure go into the project with the idea that it will fail, that's not an issue. However if you're serious then what the poster is saying is that its over a lifetime's worth of work.
Time to plug the osdev wiki, a very cool read even if you're not planning to write your own OS. Here's the beginner mistakes page: https://wiki.osdev.org/Beginner_Mistakes
The Linux that was created by one person is far from the Linux that exists today. The Linux that exists today was created by hundreds (thousands?) of people over the span of decades.
But they're usually started by one or a few people. If they get somewhere promising more resources can be added over time to make it more than a cool PoC.
Fair enough, pulled those numbers out of my a, and they are indeed too high. ~1,000x seems like a better incredibly rough guess. i.e. if the Songbook project is 1 dev, 1-2 weeks, this would be more like (extremely roughly 2-4 years for a team of 10 devs).
I don't think so. For one, he wants it to run in the browser, and SQLite doesn't run in the browser. In his "embedded" section, he makes it pretty clear that he doesn't want to build on SQLite:
> SQLite is a great example of that: it is a very powerful relational database that runs almost anywhere. What I miss from it that SQLite doesn’t provide is the ability to run it on the browser
> Mentat was an interesting project, but its reliance on SQLite makes it inherit most of the downsides (and benefits too) of SQLite itself.
Also, I'm sceptical you could make SQLite efficient for an immutable DB, IF you want to support full SQL. A DB like Datomic has to do some pretty crazy stuff to be remotely efficient, and it doesn't even try to support full SQL - a great read on Datomic internals: https://tonsky.me/blog/unofficial-guide-to-datomic-internals
Mentat, the DB he links to, is much less ambitious than this proposal, and had a couple years of work by a number of devs before eventually being shuttered/abandoned. It's about 35K lines of Rust, despite being built on top of SQLite, isn't remotely done, doesn't run in the browser, etc.
Even if you did build this proposed DB on top of another DB, to make it efficient, and have all of the features discussed, I strongly suspect that it would be an absolutely massive amount of work.
CockroachDB sounds like it has a target market of a couple of hundred to low thousands companies though. How many people are in the market for a "[A] SQL database for building global, scalable cloud services that survive disasters." [0]? It sounds like a pretty good product, lots of potential, etc. But there are not that many people who need to run their own cloud and maintain uptime in the event of a data centre being taken out. If I ever use CockroachDB it is probably because some cloud provider decided to offer it and not Postgres. Those core features are targeted at problems that only a very small subset of people have.
Compare that to embedded/immutable/syncable/relational. Everybody in software has dealt with software where that would be a compelling feature set in a database. These are solutions to problems any developer has encountered. It is, in that sense, a much easier project for people to understand and get behind.
OP probably doesn't have useful opinions on how his software should handle a direct nuclear strike on his production machine. He probably does have useful opinions on when he should cheat and let a little bit of mutability into the otherwise 'immutable' system.
I've built out a journal layer on top of SQLite for a flash card app I'm working on. I wanted to be able to sync data across devices but I knew that I'd always just have to handle a single-user/multi-device scenario and never multi-user/multi-device. I was able to come up with a simple CRDT style system where I just log the modifications made per device and share them out with other devices via simple files (i.e. upload to central file store and download peer journals). Each device then has to reconstruct the state of the SQLite database entries using the shared journal files. Since it assumes single user, I can safely use "last write wins" as the conflict resolution strategy.
It's working reasonably well for my small use case, but there are definitely limitations that you have to be aware of. The big one is the fact that it's journal-based means the journal can grow and grow. Since I use simple files as the protocol for sharing changes, this means uploading/downloading the same content more than I'd like. If I had a stream-based server that just handled the deltas, it would be better, but my goal with this project was to avoid having to roll my own service. (I wanted users to provide their own cloud storage, say iCloud or Dropbox, and the app would just use it as a central repo, meaning no fees for maintaining the service.)
There are tricks around these limitations that I can work on if I had the time (my main goal is to build the app and not a sync system). For example, I could cap the journal files to a certain size and create a new one when the limit is reached. This avoids having to upload the same contents too frequently. I could also have a strategy where I merge all journals into a central file every so often so that the state of the world is shared there instead of the journals (which can then be discarded).
Interesting! I wasn’t aware of that feature. I actually built out my idea first as its own toy database with journals and later added SQLite as the storage mechanism.
Patchset could work for this, but I’d have to play with it a bit. Off the top of my head, I’d need to redesign things to consider time in terms of sessions as opposed to granular changes. i.e. today I can just append deltas and use a time stamp to know how it fits chronologically with other changes in peers. A new peer created could sort all deltas to play back the database from scratch. I suppose with patchsets it could sort them all by start time and apply them one after the other to recreate the database.
It’s an interesting problem because you can’t assume to have synced recently, so you don’t know the baseline you are starting with, but for the application I’m building, that may not matter.
This sounds like a cool project.
Considering its a journalling app and I'm assuming the user just enters plain text, have you experimented with compression and decompression before and after the SQLite storage?
It's actually a flash card app, not a journaling one, so the text entered is quite short. I haven't played around with the compression idea, but it sounds like it could be useful for large text entries.
Immutability is an interesting property, but reality is messy. Depending on the use case you envision for the database, data that needs to be scrubbed will accidentally be added to the system. This could be an important private key, or something legally dubious. In an append only/sync system there isn't really a good way to scrub that data. (Replay the entire history, filter bad data, then delete all remotes and resync from clean copy is not good). Again depends entirely on the use case of the database you envision, but if you wish to make it general purpose it is something to consider.
This article[1] from 2015 has stuck with me for years, talking about the same approach. The interesting thing about transactional/bitemporal databases is really the effect on how we consume them with our applications. The schema can be separated from the transaction data, schema migrations become easy, tables become cached materialized views (with the cache updated concurrently at write time), reduced need for application-level cache, no cache cold start, and pub/sub becomes the norm.
I’m the same regarding that article. I always thought of that article as a kind of “deconstruction” or “decoupling” and “rearrangement” of traditional DBMS architecture.
The end result you speak of is a more flexible data platform, which can handle more use cases, including applying decoupled parts to different things.
A good example, in my mind, is Apache Spark - which is like taking a compute and SQL query engine, making it distributed and decoupling it from the underlying storage. Now, suddenly SQL or code can be used on anything you can fit into a data frame, even streams - wow!
Pulling out the transaction log - and wow, suddenly we have durable, potentially immutable queues we can use for even long term storage.
I was so excited by the line of thinking in that article. :)
> Immutability is an interesting property, but reality is messy.
Words have fuzzy boundaries - immutable obviously doesn't mean actually immutable. It means immutable like how git is immutable - you're not meant to change history. It'll fight like a tiger to stop you easily changing history. It'll give you two different ways to get what you want while preserving history. But if you want to change history you can.
SQL is said to be declarative but if you get hit dealing with the wrong sort of workload in SQL then the implementation details can start to matter extremely quickly.
In practice, the author probably only means 'much more immutable' rather than 'illegal to use with European data' level immutability.
True. And what makes it easy is it's immutability. When you change history in Git, you rewrite it (rebase). But re-writing is only possible if you know every bit of the journey up to now. When you have actual updates, history is lost.
the way git changes history is a stretch on the word change, rebasing creates a parallel history and then promotes it to be the true one. the reflog still remembers the previous history.
It’s interesting because it reflects the reality of modern computing — for many use cases storage capacity is not a meaningful constraint.
I ran pretty important systems as a DBA years ago that we’re significant but would never, ever require a meaningful constraint on growth. (Performance, different story)
Sometimes looking at a problem like a database with a new set of constraints leads to interesting and novel things.
I've been working on personal and professional projects where growth isn't a truly restricting constraint but is still an actual constraint. What do you mean by "meaningful"? Once I get into the 10's of terabytes there's no storage system where copying a TiB of data is "instant" or ignorable.
Professionaly I work on a financial reporting system where there are around 1.5 million accounts with trading activity. Things like moving a month of data around and checksumming still take on the order of 1 hour. If the size of that monthly dataset doubled, we'd be OKAY but it it grew 10x we'd have to start restructuring things both on the software side and the hardware side.
For personal work I am working with searching video frame substrings regardless of codec, resolution etc within videos in a ~50 TiB video set. Having to re-encode hundrends of GiB of video at a time is some thing that takes a huge amount of time and nothing something I can just "throw into an S3 bucket". The growth of said working set is not truly constrained but is actually constrained. I can only afford so many 8-10 TiB HDDs at a time! They also fail and have to be mirrored/backedup with rsycn at the very least which adds to the cost for me.
I mean that it’s not a technical constraint or limit. Not that its trivial to do! If there is an ROI in keeping stuff forever, you can.
Think about what you’re saying about your personal project. What you’re talking about wasn’t financially achievable for TV companies just 20 years ago. Hell, anything older than a few days would be stored offline in some tape silo, or in a shelf. (And 10-12 people to attend to it!) Now you as an individual could afford to do all sorts of stuff with that video data, albeit for some significant $.
Let me disagree. I'm working on a transactional OS [1] with an immutable filesystem, it's primarily a website builder. Every request creates a new snapshot of your website. It's append-only for the duration of the transaction. That's the only way to provide robust rollback and also provides version snapshots for free - but it doesn't mean snapshot's can't later be removed.
My solution was to provide an undo-chain of a fixed given length. Versions older than that simply "fall off the chain" and will be picked up by GC eventually. You can also make snapshots long-lived snapshots if you want.
(Of course removing a snapshot is an append-only operation :P but GC will go after non-referenced snapshots eventually.)
The problem of deleting stuff from a history of changes is challenging. But I think Darcs, the distributed version control system written in Haskell, does this rather neatly.
Darcs is different from git in that a "commit" (called a "patch" in Darcs lingo) does not depend upon the entire history of what has come before. Rather it has specified dependencies, which Darcs can infer itself – or you can specify if you want more semantic content in the patch structure.
A repository is a collection of patches, which are merged using something called patch algebra. So, if you want to delete a patch, you just issue the command "darcs obliterate" and it will remove the patch, and anything which depends on it. Then working directory is reconstructed by merging together any loose ends using patch algebra.
With careful administration of dependencies, this approach makes deleting specific things very slick.
Database immutability does not mean it's impossible to delete content.
It means you separate two use cases: retracting information (I remove an item from my shopping cart because I want another item) and purging data (I delete my account and all my shopping history because I don't want the shop to store my data anymore).
You can still purge data easily in Datomic, for example.
I really don’t get why GDPR gets dragged up every time someone mentions immutable data stores; was everyone really building append-only systems before 2018?
It has been a legal requirement in the UK that databases must be able to delete personal data since the data protection act 1984 - https://www.legislation.gov.uk/ukpga/1984/35/enacted - in particular see the data protection principles in schedule 1: the 6th principle requires that data is retained no longer than necessary. There are provisions elsewhere in the act that can require a data holder to delete data.
Most of GDPR is not really new. There was a significant amount of teaching about data protection when I was at university in 1994-1997 and it wasn’t even new then.
About a couple years ago I used sql.js [1] in an internal tool to provide the user the ability to execute SQL queries against the data they had loaded. It worked just fine.
> test_onefile.c - This file implements a demonstration VFS named "fs" that shows how SQLite can be used on an embedded device that lacks a filesystem.
Especially an immutable database (like Git's index) sounds like a very powerful concept. No more need for modification dates (they would be in the change record) and no need at all for separate log tables and all that guff. Every CRUD system ends up with some sort of audit log tacked on, costing heaps in terms of both maintenance, storage and extra processing for every change. In addition I would want there to be a command analogous to `vacuum` to truncate this history while the database is online, similar to Git's rebase or filter-branch.
I already got one? My employer mostly has these requirements. We ended up writing our own query language because it is a graph DB, but all the bits are there.
LMDB + a journal + a query interpreter is really all you need. We don't have off-line mirroring yet, but that's on the list.
You could start with Synapse [] and rip out the bits you don't need. It is open source.
Yeah, I agree. At AppShare we take our server side databases and let them run in the browser. We even copy the data so that you can take a server side Sqlite database as save the Sqlite engine and all the data as a HTML file
Having it on the browser level would have the huge advantage of the DB contents being visible to the users in devtools (not just to the developers via their custom tooling).
I don't think browsers have a good API it could store to, though. As far as I know there isn't really anything for storing efficiently in-place updateable/readable large binary blobs.
A basic implementation would use one document per page. An advanced integration might replace the whole sqlite storage engine with one based on IndexedDB.
I'd like to make it go away, but as a simple k-> v store it's decent enough. Web SQL provided relational data and aggregation, something that's sorely lacking from IndexedDB (and the libraries that build on top of it, poorly replicating SQL in the browser).
The amount of client side code one has to write to replicate simple SQL group by/sum operations is mind numbingly bad -- not to mention performance hit vs. SQLite's blazing native C performance (and it shipped with the browser to boot, no separate download of some lousy bloated 3rd party JS library).
I think DuckDB checks a number of these boxes! It is embedded and written in C so compilable to WASM. It is also 10x faster than SQLite and interoperable with Apache Arrow! It might be a good place to start anyway!
I think it's crazy that database updates are still acceptable in the design of systems and databases. Updates are deletes in practice, you lose what was there before.
Each record is stored in 2 * [# of nodes] * 3 + [backups] places. In our case that's 15 places before we talk actual backups (online, offline, and cold storage).
I'm actually a huge fan of immutable records conceptually, but I am unable to make a business case for it outside the most vital tables, that need amazing audit histories. The costs are just too high, and a lot of data not valuable enough.
I'll acknowledge that if engineering time was lower it would be an easier sell, but in large databases you're still talking about at least a quadrupling of space through every layer: What's the dollars and cents argument for that?
Most software isn’t FAANG. In the long tail of database size, I’d guess 95%+ of databases are smaller than a few hundred megs. And at that size, storage is cheap and storing a full historical record of changes should absolutely be the default.
Rich Hickey has a great talk on this a few years ago talking about the difference between Place and Value. He says accountants learned this lesson hundreds of years ago with bookkeeping. Users just don’t generate much data relative to the size of our disks. Well worth a watch, especially if you disagree:
> Most software isn’t FAANG. In the long tail of database size, I’d guess 95%+ of databases are smaller than a few hundred megs.
I think that statement is plain and simply untrue for most databases that generate enough value to pay for a developer working on it. I would bet it can be reversed to at least 95% of business relevant databases are larger than a few hundred mb.
I manage a small 200 player old-school browser game for a few years and our database creates a few hundred megabyte per month. A major source of this data is a user trace-log ('what action they performed when' compact as a timestamp and 2 integers), which we clear on every new round (about 2 months). Keeping every update (instead of the small trace log) would easily scale to gigabytes per month. And we're talking 200 users. I also worked for a small 150 employee web service company, where the development database snapshot was about 5gb (a heavily trimmed down and anonymised version of the production db).
Now, cold storage is cheap. But fast access database storage isn't the same as a disk. That's why we keep hot data in memory, only write to disk when necessary and backup to cheaper storage.
Lastly, a relational databases WAL is a complete historical record of all writes and reads, not just changes. It's used by default and in fact necessary to accomplish ACID and consistency guarantees. Keeping this log gives you the full history that allows to restore the full database state to any given point in time without polluting the actual data with historical records. Granted, access to this data is much harder than a simple query.
There are also many other options to keep historical data rather than keeping it in the OLTP database, like OLAP data lakes.
So immutable records only makes business sense in databases too small to benefit from immutable records (where you can just version the entire database, for a "few hundred megs")?
I don't understand what FAANG has to do with this. Medium or large non-tech companies commonly have business critical databases in the hundreds of gig range.
Acting like medium to large databases aren't common in business is just outright odd. The argument also doesn't address what I asked (business case for this vis-a-vis cost).
> So immutable records only makes business sense in databases too small to benefit from immutable records
My claim is that discarding historical data is an optimization. Its an optimization that should be off by default and turned on when needed. Archiving and compacting history should be something you do only need to do when your database size gets out of control.
For small databases there's no reason to throw away historical records at all - and having an immutable log of records and updates should be the default.
Sure. Immutable doesn't mean you can't delete things:
> True. And what makes it easy is it's immutability. When you change history in Git, you rewrite it (rebase). But re-writing is only possible if you know every bit of the journey up to now. When you have actual updates, history is lost.
Also have wanted this. Operating on data locally means it can be fast and low latency (on device, not between devices).
I think the secret lies in using CRDTs to sync data in a way that allows it to be processed into another system. E.g. Postgres on a server and SQLite on a phone.
Noms has gone pretty quiet since the acquisition. Dolt is based on noms (not sure how much is has diverged) and much more active: https://github.com/liquidata-inc/dolt
Upvote for Syncable & Embeddable. It will take a massive amount of time to build from ground up, so I wonder if it's sensible for you to build SQL on top of RocksDB like CockroachDB did: https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/
Aren't relational AND syncable very difficult to combine? Foreign key references for example may not yet exist on one instance when they are synced. P/CouchDB and Git support multi master sync which is great, but they are not relational.
The workaround is to detect foreign key violations in your app with indexes or a job that runs on the nodes.
Log shipping works well for single master scenarios and these solutions can be combined with a partitioned primary key space for inserting new rows in the remote databases.
A two-way embedded/server relational sync is a much more difficult problem but it has been solved before. SQL Anywhere is a commercial relational database that has exceptional synchronization capabilities for fleets of mobile devices. Its MobiLink gateway synchronizes with heterogeneous DBMS servers like Oracle.
A new relational sync project should become familiar with MobiLink, in my opinion.
I think what's even better than relational may be directly using OOP objects, using the system prevalence pattern.
If you're willing to model your "DB updates" using event sourcing, you could fork something like Prevayler to achieve your immutability needs while being able to serve as source of truth across multiple nodes.
It's very difficult to have traditional OOP with pervasive immutability, because it almost entirely removes the idea of a stable instance of an object. There is no single instance that you call methods on or send messages to. Every method call or message processing must produce a new instance.
More generally, encapsulation fails here. Immutability cannot be encapsulated behind the same interface as mutability. An immutable interface by necessity will look different from a mutable interface (at the very least methods that would normally return `void` in a mutable interface cannot ever exist in an immutable interface).
Scala is mostly immutable and OOP+FP works very well with it.
Also, just because an object is immutable doesn't mean that every method is going to return a new instance of that object. For an example, a 'sum' method on a list of integers object would return an integer, not a list.
As a general rule of thumb Scala is precisely OOP when it's mutable and FP when it's immutable. The fusion occurs exactly at the boundary between mutable and immutable code.
Yes if you have only read-only methods then immutable and mutable methods coincide, but that's precisely the place where mutability doesn't matter to begin with.
Eventually somewhere in your code you have to write something and that's where mutability both matters and results in different APIs.
However, you're right that my initial statement "Every method call or message processing must produce a new instance." is way too strong. I should've meant every write-based call (or message processing).
Good points. There's also the minority notion that you can use OOP where it design-wise makes sense, and procedural/functional/whatever where that makes sense. Of course, if code becomes too esoteric, you risk making a special snowflake noone else is willing to look at.. So probably kind of explains why people take an all-or-nothing approach is most general cases.
My point then is to fully utilize OOP where applicable, to gain the advantages of orthogonal design. This special effort in design may take many attempts to get right, in order to provide more flexibility and power than in initial iterations.
Gotta figure out what your needs are. For example, if browser and node is a requirement than you can look at RxDb which uses Pouch and either Indexeddb or various file options.
If you want everything, it’s going to be big and not appropriate for web or mobile unless part of browsers or the OS.
Hi, I’m Mikeal, I created PouchDB. After a year or so I handed that project off to some great maintainers that made it much better as I had grown a little skeptical of the replication model and wanted to pursue some alternatives.
It’s been about 10 years, much longer than I thought it would take, but I have a young project that finally realizes the ideas I had back then.
Sometime after PouchDB was created I realized that it just wasn’t going to work to replicate a whole database to every client. In fact, even a view of the database wasn’t going to work, because the developer isn’t in a position to really understand the replication profile of every user on all of their devices, you need a model that has partial, or more accurately “selective” replication based on what the application accesses in real time.
I became convinced that the right primitives were already present in git: merkle trees. Unfortunately, git did a very poor job of surfacing those primitives for general use and I wasn’t having much luck finding the right approach myself.
Shortly after joining Protocol Labs I realized they had already figured this out in a project called IPLD. Not long after that, I started leading the IPLD project/team and then putting together my ideal database whenever I found a free moment.
It’s very young, lots of missing features, still working on some better data-structures for indexing, but it is very much a database that replicates the way git does and approaches indexing over a primary store the way CouchDB does, but there’s a lot more too.
With these primitives we can easily nest databases inside of other databases (and create unified indexes over them) and we can easily extend the data types in the database to user provided types. Using some of these features it already supports streams of binary data, databases in databases, and linking between pieces of data.
Let me know what you think, there’s still plenty of work to do, but it has 100% code coverage, so it does what it says it does :) https://github.com/mikeal/dagdb
Indeed, Irmin is a library to implement data stores which can be freely cloned, updated and merged along a model similar to git.
It separates the content from the actual store. So it can run as well on a file-system, on a git repository, in memory or in the browser (See for instance https://github.com/talex5/irmin-js).
Redwood is a decentralized database providing a data model that's much like a Redux state tree. One of the fundamental design goals was to better harmonize how data is handled on frontend and backend layers.
It's also:
- syncable/offline-capable, thanks to a CRDT system
- able to serve as a standalone Git backend
- embeddable into Go applications
- immutable, in a blockchain-y sense, due to its append-only ledger
There are some demos in the Github repo:
- A real-time collaborative text editor
- A chat application
- Git integration
- A video streaming application (still improving performance here, but it's functional)
It's pretty alpha still, but feedback and contributions from anyone interested would be very welcome.
It's a super interesting set of requirements - and I've been mentally trying to wrestle with an extra ones on top of that - selectively syncing data.
Let's say I've got three users of my bookkeeping app. Person A and Person B are in different companies, totally separate data. Person C is a bookkeeper and wants to work with both their sets of data.
It's a trivial example where you can just say "well you just keep Person A's and Person B's data separate and switch what the bookkeeper gets to see when they want to reconcile separate accounts".
I don't have any good solutions for it yet, but I'm enjoying exploring them.
I just want a database that takes no configuration options except for a list of peers and a disk partition to write to and provides strictly serializable transactions 100% of the time (modulo partitions).
I'd rather start with slow and correct, and implement hacks to improve performance once slow becomes "too slow". But instead, database engines are sold on benchmark performance, so out of the box they lose transactions they told you they committed, or read uncommitted writes, or all sorts of other crazy things. I don't want this unless I write code myself to do those things. (If I go through the API, it must be completely consistent. If I clone the drive of the database server and pick around at random disk blocks, sure, maybe I'll see some uncommitted writes.)
I really like FoundationDB. It still feels like that system where if it breaks you will be blamed aggressively for picking a weird thing, so I'd like to see some more success stories, I guess. But if I really care, maybe I should be the success story.
The Go API is also kind of weird, and conflicts with my general thought on Go code that every I/O operation should be cancelable with a context -- they really don't subscribe to that point of view.
Those are the main reasons why I don't use it... but I guess they have the fundamentals down (the "foundation" if you will), so it's probably worth powering through those nits and start using it.
I'm not sure what you mean by not as flexible as relational model. I'll address SSB, as I'm more familiar with it, as I've built a CRDT system on top of it in clojurescript.
Each message can be linked by its content addressable hash. The engine allows for Map/Filter/Reduce queries which are not much different in capabilities from SQL queries, these can even be indexed to speed up views. The only real difference is that no schema is enforced on each message type, but it's not that difficult to enforce something like that on a higher layer when publishing entries.
If you think about it, storing JSON entries is a lot more flexible than table rows. Also, the content stored need not be JSON, but EDN, transit or reference to binary blobs.
> Each message can be linked by its content addressable hash.
What I had in mind is, while as you well said, you can link the data in a pub in any way you need, the distribution model is fixed. The only way to change that is creating more pubs.
I think the distribution model is an integral part of the data structure in a general p2p setting and needs to be customizable, within reason.
Thanks for your response, I find this stuff super interesting to watch evolve.
I think I understand a bit of your concern, although note that the pubs aren't necessarily required for replication in SSB, one can use onion routing, physical devices (sneaker nets), local area connections, etc. But I agree, that there are more areas to improve, such as changing the network/capabilities keys on the fly from within a message to ratchet the distribution, making reference to content hashes within different network keys, etc. I'm curious if your work takes any of this into account? Thanks again.
Maybe one can learn from and avoid the biggest mistake of sql: mixed commands and date. Maybe one can separate the command from the data in the control.
Take a look at Syncbase (https://vanadium.github.io/syncbase/), it has some of these characteristics including secure p2p sync and a SQL-like query language. It's not under active development but might be a decent starting point to pick up from.
I totally agree and love this list, but also wish for graph queries like OpenCypher. The graph community really needs something like that because right now it’s all centralized, expensive and proprietary, or tied to the JVM. If SQL standards evolved to include a MATCH statement and ASCII art for traversals, it would make JOINS so much easier
Have you ever used a flat file database? I envision the perfect database having a file system API that automatically sorts and can be edited using standard files, but stores data using like JSON or YAML. That way it can be versioned and queried using the standard file system tools, or those that implement the Unix Philosophy.
I tried this for a small project once about 10 years ago, using svn as version control and putting a GUI around it. It was clunky, with no good way of doing uniqueness or foreign key constraints, and slowed way down after ~1000 records. I ended up moving to a regular database and implemented the versioning logic manually, which was pretty straightforward.
At first this irked be because the final solution didn't retain the full power of a VCS. No merges, no branches, etc. But what I finally came to terms with was that it didn't need to (YAGNI): it just needed historical records and that was pretty easy. It ended up being much cleaner and faster that way.
I think that you'd have to also implement indexes to make that work, just like SQL databases do. Even with SQL, I ran into plenty of cases that indexes were necessary for decent speed. Flatfiles would just amplify that.
And, if you're trying to use the VCS to do merges, the indexes will constantly get messed up, so they'd have to be reindexed after every merge to be sure they were legit, and it still wouldn't ensure uniqueness... Ouch.
I am yet to see any GB/TB project that cannot be handled by traditional RDMS systems, so I don't wish for new databases other than when I have to work without either Oracle or SQL Server and wish to have their productivity tooling around.
With those priorities in mind I'd probably start with RocksDB and build SQL engine on top. I'm really impressed with RocksDB performance, space requirements and feature set.
TileDB Embedded (https://github.com/TileDB-Inc/TileDB) checks all four of these boxes. It is a universal storage engine based on dense and sparse multi-dimensional arrays. I explain each point below.
> embedded
TileDB Embedded [1] is an open source (MIT licensed) embeddable C++ library which exposes C and C++ APIs. We also built APIs for Python, R, C++, Java and Go, and integrations into MariaDB, PrestoDB, Spark, GDAL, PDAL and more. Via MariaDB we even have embeddable SQL[2]. TileDB abstracts the storage backends behind an extendible VFS class, and currently supports S3, GCS, Azure, HDFS, and local disk.
> immutable
TileDB is designed around immutable objects. Every write creates a new "fragment" using a MVCC model[3]. No file is ever updated in place. TileDB is designed to naturally handle the eventual consistency and restraints of cloud object stores. This allows for features like multi-reader/multi-writer support, time traveling, and update support all within the storage engine without any external orchestration.
> syncable
TileDB's MVCC approach to handling the eventual consistency of cloud object stores, yields directly into its sync-ability. Every write operation in TileDB creates a new "fragment" [3]. A fragment is an immutable folder (or prefix on object stores) that contains all the data from a single write session. Fragments are created with a timestamp and a UUID to ensure uniqueness. The fragment is ignored until a special `.ok` file is available in listing (each write is atomic). Incomplete fragments (without the .ok file) are gracefully ignored.
This means that every fragment is self-contained and syncable. The only requirement is that the special `.ok` file must show up only after the complete fragment folder is synced. With cloud object stores this is not a problem with the read-after-write consistency guarantees. For other systems the synchronization can be managed to handle this behavior.
> relational database
Tables can be easily modeled as multi-dimensional sparse arrays[4]. Through TileDB's integrations with MariaDB and PrestoDB (and Spark), you gain a full SQL interface, while being able to query the data directly via the language APIs without using SQL. For more robust features, such as foreign key enforcement, it is easy to use embeddable MariaDB to achieve this, while we work to push such features into the storage engine itself. We are always seeking feedback on which features are most used and requested to pushdown common operations to the storage engine.
Firebase (whether it be Firestore or realtime database) has almost none of the desired properties in the article. The only one it does have, as far as I can tell, is "syncable".
I'm not sure what exactly Scuttlebutt covers or does not cover, but an example real problematic situation that I have seen is like this:
1. The system among other things stores user-entered content;
2. User A has published sensitive personal data about person B (perhaps in a doxxing-like post, perhaps by making a fake profile with their name and data, perhaps including interesting images), and to make things interesting also regarding other laws in other countries, let's assume that person B is underage (which is also realistic and common, school kids do things like that to each another).
3. Person B requests that user A removes that content, and A refuses.
4. Person B escalates (the specific process does not matter much) finally resulting in you getting a legally binding request to remove that information about person B from your system/product.
So the key tech requirements include both breaking immutability and full centralised control; you can do decentralised systems for e.g. performance and bandwidth reasons, but fully decentralised, user-controlled censorship-resistant technologies can't be used because then you can't implement censorship, and you need that ability.
There doesn't even have to be a person B. Under the terms of the GDPR you have the right to an erasure request, which means the party holding your data must erase your data, whatever that may be, and it must be erased from backup systems too.
I believe decentralization is a cool concept from an engineering perspective, but it creates more problems than it solves. The current model of big, centralized services is kinda like democracy. Bad in a lot of ways, but everything else has even bigger problems, so that's what we're stuck with.
First, it's very hard for end users to actually connect and discover eachother. NATs are everywhere, so direct connections are often not possible, and introducing a central server defeats the point. Most non-techies spend a majority of their time on mobile, so having connections in the background is also not an option. Everyone having their own servers is not realistic either. It has been proven time and time again that, if people can choose between paying or getting something for free, they will choose the latter. Whether the ads are personalized or not (what some people call tracking) doesn't matter that much.
Monetization is hard with decentralized services, and money
is very often required to make something with good enough UX for most people. If you don't collect a lot of data, you make ad fraud easier, so ads are not an option. People don't like to pay (see above) and payments are hard to enforce if you're not managing the service, so charging is not an option either.
Making decentralized services is much harder than making centralized ones. It's easier to pretend everything in the world happens in order and accept events one by one, than to deal with the messy reality of thousands of devices doing thousands of things at once, without a coordinating entity. Imagine if one user kicks everyone out from a group, while getting kicked out himself in the meantime in other parts of the network. What to do when he goes back? Who should decide? Timestamps can't be trusted either, as they can be faked. Now you're having consensus problems, dealing with byzantine fault tolerance and inventing a blockchain. A central, trusted server is easier to write and less prone to security exploits.
Storing everything on your own device isn't great either. If you have 100 friends on Facebook, and each of them has 1000 photos, you don't have enough space to store them all. Not storing them isn't an option either, as you might actually want some of them at some point. Asking the original poster doesn't work, as he might not currently be in the app, so disconnected from the network. With something like a decentralized Amazon, you need search, and good, privacy-friendly, decentralized search is somewhere between hard and impossible.
If you can't have captchas, rate limits and a lot of data about your users, it's hard to prevent spam, spreading fake news, attacks, catfishing and similar behavior. Restricting your network to friends works, as long as you accept the fact that no stranger (i.e. at your new place of work) can ever find or contact you.
Even if you manage to overcome all those problems, there's still a pretty high risk that someone might do an embrace, extend, extinguish on your network with their proprietary app, kind of like Gmail is doing with email.
I want decentralization to work and Facebook to be gone just like everyone else, but, as sad as that is, it never will.
Solution should be a generic CRDT implementation with concepts of change log with unique identifiers for each participant and operation. But we also need a filesystem/operating system support for merging. As otherwise you will need to have a unique file for every operation to cope with merging. Or you have app specific merging and no data interoperability between apps.
My "small" wish would be to have a local but replicated data format for a text file. So I can edit it in any editor while still retain the history and allow for collaboration from multiple devices. I'd love for it to be shared between devices using dropbox and iCloud depending of the user.
> An embedded, immutable, syncable relational database.
Sounds like he wants a blockchain?
A blockchain probably isn't really what OP wants. But funnily enough it does meet all those criteria.
> what I’m after is a database to be embedded on client-side applications itself, be it desktop, browser, mobile, etc.
Thats how most Blockchain clients work - on the client side, with APIs (usually RPCs) to chat with other programs
> Being immutable means that only new information is added, no in-place update ever happens, and nothing is ever deleted.
Immutability is exactly what blockchains were designed for.
> Syncable
Since blockchains are decentralized by nature, this also comes out of the box.
Of course if you choose a blockchain you have to deal with a host of other issues... performance, no SQL (in most clients), harder to debug, etc. But still a fun thought experiment
Blockchains are meant as a data store. Archival, verifiable. But not easily searchable. There is no design on blockchain for advanced query languages/logic.
It seems like the OP is going after something like blockchain that adds querying. But I don't think blockchain syncing is designed well enough for client-side use, and the OP seems to want to use this database client-side in JavaScript. I think it would be possible to implement blockchain syncing with WebRTC, but WebRTC has some security risks for users (IP leakage mostly) so I am not sure I would start off with that as a primary communication channel.
Most blockchain clients have a DB with a cached version of the chain. This is typically done with a typical, off the shelf DB. For example Bitcoin Core uses LevelDB, but other implementations use SQL lite.
As I said at the end, blockchain isn't really what OP wants. It meets his requirements mostly via technicalities.
Hyperledger fabric gets around this by storing a copy of the data up to the last block in some other db, like couchDB. That way, you get (supposedly) the best of both worlds
The most popular consumer apps (FB, LinkedIn, blogs, photo sharing, goal trackers, slack, basecamp etc) shouldn't be centralized in this fashion. Data(bases) should be owned by users, and synced with friends and colleagues. So instead of a single centralized database, we need a million tiny little databases which are synced with peers.
Tiny databases fit well into today's hardware capabilities landscape. We have plenty of fast storage, and powerful CPUs on computers and phones. It's time to stop building as if the server is the only place where queries can run efficiently. Truly serverless, that is. This also improves data security in general; there is no central place to hack.
Happy to talk about this in detail. Please feel free to email me.