So, i realize this project is early, but it would be EXTREMELY helpful to walk through someone's use case - like, who is the target here? A business analyst who iterates on cleaning / analyzing small excel csvs? Or someone else?
After watching the screencast, all I saw was a bunch of commands explained (could have read the docs for that), instead, I'd like to walk through a use-case where this solves someones problem.
He mentioned at least one. His friend went to a cabin where there were little to no internet connection, then he updated the database on a local device. And later on other database-nodes would just pull the updated data.
Seems like it is for personal use. Maybe something to build apps on top of.
GC I can see a shape of solution for since you can use something like a per-object DVVset to determine the minimum set of unresolved histories required to avoid losing data during conflicts while not unnecessarily ballooning the size of the dataset.
However, the inner-object conflict-resolution problem seems a lot harder to solve given that there's no obvious join-semilattice for arbitrary fields/data. Can you discuss what conflict-resolution strategies you're working on for auto-resolution and/or what metadata you intend to provide to the end-user in the event that you're going to punt resolution to them to handle?
Given this is supposed to be for collaborative workloads, the conflict-resolution issue seems to be a cornerstone. Git handles this by inserting sibling sections into the documents and forcing the end-user to manually deal with fixing problems, which is often fraught with pain and peril, and doesn't seem like a strategy that would work for something that's a database (as opposed to something that's a workflow).
This is a question that we've gotten quite a bit. It's our view that there's no magic solution to conflicts. There are logical conflicts in the real world that must be arbitrated.
That said, it's a surprisingly basic thing, but just knowing what changed from party (a) and party (b)'s perspective (relative to their most recently agreed-upon state) is somewhat rare or ad-hoc in existing systems. In noms, you can directly compute exactly how state diverged and apply whatever resolution strategy is suitable.
We have plans for applying default conflict resolution for changes to data-types that - in many cases - will be correct, but in the end, there's no avoiding that correctness can only be defined within a given specific domain.
I've played with this problem on and off over the last few years. ShareDB[1] is powered by JSON OT[2], in which each change describes the meaning behind what you're trying to do. (For example, 'increment counter' is different from 'change counter from 2 to 3'. They look the same, but behave differently in the case of conflicts). Just knowing what changed often isn't enough to do proper resolution.
I've spent years on and off playing with a better, faster, stronger version of the JSON OT code[1] which also supports arbitrary object reparenting. You run into problems where you really want conflicts as well. For example, given {x:{}, y:{}} user A moves x into y, and user B moves y into x. There's no good solution to resolving this without more information or conflict markers & humans.
Doing this in a P2P setting is hard & interesting. Very cool stuff though!
Having a "high level" description of what changed makes manual merges easier.
I wish git had conflict resolution like this. Treating data as an arbitrary sequence of bits is general and correct. Treating data as having a particular format can be useful, too.
It does? git merge --strategy xyz means git will invoke git-merge-xyz to do the actual conflict resolution. It comes with a variety of built-in ones (see https://git-scm.com/docs/merge-strategies) but you can write more if you have some special-purpose approach you want to use.
Git also has smudge/clean filters if you want to transform your file to a format where line-by-line textual merge is more meaningful
And you can use .gitattributes to make certain extensions/folders/file/whatever default to a certain treatment.
A good example is https://bitbucket.org/sippey/zippey which unpacks zip archives (and hence file formats based on them, like .jar, .docx, etc) to allow the contents within to be tracked in the git repo better. Other custom formats could do something similar...
To me, there are two separate concerns: intent preservation and coherence.
Operations cause a change of state. The sequence of operations that are performed given a user action need to ensure that they will cause a modification of the state of the database in the way intended by the user.
Coherence is about maintaining a design. A set of rules are conceived. For each state, it is possible to determine whether the state is valid according to them, and if so, the database is coherent.
To maintain coherence, it is possible to deny an operation, therefore breaking user intent. To maintain user intent, it is possible to accept an operation that leads to an invalid state. From what I understand, noms is heavily tilted towards coherence, like git, but unlike git, it doesn't have the social "pull request" aspect, nor a support for running tests.
You mention diffing as a plus, but realistically the only use of diffing is to guess intent. Real intent can only be provided by a maximally rich set of operations. Logs of SQL queries, for instance, are more likely to provide insight than a cold diff. For a database that is tilted so far towards maintaining coherence at the expense of user intent, it may be a good idea to compensate by staying close to the operations.
Finally, if you decide to tilt towards intent preservation, there are definitely approaches to automate merges, avoiding nagging the user to fix conflicts. The most successful trivial solution remains latest-write-wins, which gives surprisingly good results assuming a high data granularity and a rich set of operations. But unless there is a way to automate coherence validation (which the equivalent in git projects is, I suppose, running the tests), we can only rely on user attention… in which case, relying on the user to fix coherence after the fact on a database that heavily preserves user intent would be pretty much the same.
So… do you plan on supporting custom coherence rules? Alternatively, which conflict resolutions are you leaning towards?
Much of this is yet to be designed, but in general, our approach has been to lean towards a relatively layered system. That is to say, that at the lowest level, noms probably won't make a judgement about the trade-off you describe above, but rather provide primitives which allow layers above to more easily take opinionated positions.
In the near-term we'll be paying attention to specific cases that arise, and we'd welcome the opportunity to learn more about those that you may encounter.
Google's Spanner, for instance, relies on their TrueTime design, which requires having a GPS clock and an atomic clock on each datacenter, I believe. Most designs simply rely on NTP or a similar time synchronization system.
Another approach is to maintain total order of writes. Assuming some form of consensus protocol to determine write order, the unicity of the order ensures synchronization. That design, however, tends to preserve user intent less. Bitcoin has a form of that.
Only $12k for a Spectracom. Granted they are nice.
Well, I guess if it demonstrates anything is one fact -- if even Google needed GPS hardware to provide "latest" in a distributed system, then OP is right. Time in distributed systems is very hard.
> Most designs simply rely on NTP or a similar time synchronization system.
Not sure if "simply" was mean sarcastic or not. If reliability and deletion of user's data is important, and it relies on getting NTP time, I would strongly advice not to use that distributed db system.
They mention striving to support many contexts. In the demo, they showcase offline editing of a single CSV entry. If the granularity is the atomic types, then a conflict only occurs when the very same field in the very same row is concurrently edited.
Then, the system can show the conflict and offer a default that keeps the operation with the highest timestamp, or if the timestamps are identical, the one with the highest hash.
> [...] forcing the end-user to manually deal with fixing problems, which is often fraught with pain and peril, and doesn't seem like a strategy that would work for something that's a database (as opposed to something that's a workflow).
Isn't that what (for example) CouchDB does? I believe the reasoning is that conflict resolution is often application specific, so why not deal with it in the application?
Riak does this when not using CRDT data types, but the problem is often that there's no clear application or user way to deal with this either most of the time because it requires an awful lot of context in order to make a good decision.
Using the same example of how Git deals with this. Think about times you've gone through a merge conflict process on a chunk of conflicting code where you don't really have any knowledge or context for why the other stuff that's not yours is even there, and say you don't have a way to collaborate with the other developer or someone in leadership to make sense of those parallel efforts. You can only make sane decisions about complex conflict resolution when you have a lot... a lot... of surrounding context and intent. You need extra metadata, and the underlying system needs to expose that to you. That system being your engineering process, manager, co-worker or in this case that system being the database.
Hence my interest in how they're intending to expose this set of concerns to the user.
In case of conflicts, CouchDB assumes the most modified branch of the document (i.e., the document with the higher revision number) is the winner. You can resolve the conflict by choosing a different branch/revision manually, but you can also choose to not do anything.
Yes, it picks a winner, which it show on all machines (so all machines that have seen same changes will pick the same winner). But it also keeps conflicts around, so users who care about them can correctly resolve them.
Sometimes the winner it picks is not what the users want, That could surprising, but it is correct because it really is a user-level conflicts.
(Now, user may very well at a timestamp field to the document, hope ntp works well and resolve the conflicts if they appear based on that, but CouchDB tries not to make such assumption on behalf of the user).
Really? That's nearly undifferentiated from just picking one at random. How is, "whoever hits the queue most often" a useful deterministic resolution strategy? I mean I guess it's functionally no worse than wall-clock time or something, but still kinda funny. :-)
It is not random. It is consitently picking the same document on all servers that have seen the same changes. By default it picks the one with the most changes. That consintency ("the same" part is very important) it means if you replicate and bring in some conflicts, both sides will show the same state. So you won't randomly after replicating A to B, and B to A see document 1 as the winner on A but 2 on B. They'll both pick 1 or 2. So both would settle on the same state.
Also it doesn't delete or remove conflicting siblings, it is very good about not doing that to user data. Users only know exactly how to solve particular conflicts.
> How is, "whoever hits the queue most often" a useful deterministic resolution strategy?
It's a deterministic resolution strategy, and is thus useful.
> I guess it's functionally no worse than wall-clock time or something,
Wall-clock time is not deterministic; therefore it's far worse.
When dealing with distributed systems, deterministic processes are critical. Multiple systems all being right is awesome, but multiple systems being wrong in different ways is a nightmare. :)
Is it deterministic in a way that's useful? From the perspective of the end-user its going to appear random because they don't control the system environment where "highest revision number" can mean something useful to them. In fact, the CouchDB guide even alludes to this when they talk about not relying on this scheme for complex conflict resolution it seems.
Two nodes split. A and B. Say there are 100 updates to A and 500 updates to B. The split heals, the system picks B because 500 > 100, but the write you actually want to dominate is A. The user can't control which replica gets hit more often, or when a split happens, so while this might be deterministic inside the DB it is semantically random from the user's perspective. So the system can make the same choice on all replicas, assuming it can guarantee it has seen all replicas, which I guess allows you to push merging management to each replica instead of requiring an intermediate coordination replica and then re-publishing the merge state to the replicas? So there's a system optimization benefit there.
But consider if the system did pick a winner at random, how would this look any different to the user? The user doesn't necessarily know if A or B should be picked.
Deterministic behavior is really important, but it seems like it really only looks non-random to the end user when deterministically picking a least upper bound for converging a join-semilattice or when all operations on the data are commutative or idempotent doesn't it?
Yes, because it allows to maintain a consistent state across distributed nodes.
> From the perspective of the end-user its going to appear random
...but consistent. If every node picks a random revision on conflict, then when multiple clients try to continue editing, they'll end up increasing the conflicts.
> The split heals, the system picks B because 500 > 100, but the write you actually want to dominate is A. The user can't control which replica gets hit more often, or when a split happens, so while this might be deterministic inside the DB it is semantically random from the user's perspective.
Yeah, but what happens if B picks B, and A picks A? Now the write you're looking for is either there or not there, depending on on which node you're talking to.
> how would this look any different to the user?
Everything is going to look random to the user, no?
It is just that random if the systems remain isolated for a long time. Since CouchDB requires you to send the last revision number when updating a document, if the systems are are live replicating between themselves, the guy who is hitting the queue more rapidly will be forced to fetch the latest winning revision every time before hitting the queue (that may be a revision from a different guy). This will give him time to think about the revision he just received, perhaps examine the document linked to that revision number, see if everything is place, perhaps merge changes himself manually... all that before updating the document in the database.
I don't see how this could be better for a deterministic approach. The recommendations are always that the developer must implement a saner way to resolve the conflicts.
In the CouchDB world, however, I have the impression that conflict resolution is ignored most of times, so we are left with this.
(I say this based on what I do, other people's code I read on the internet and the concerns of the CouchDB core developers about educating users and developers to setup saner conflict resolution approaches themselves.)
I don't think so. Though "most recently changed" is pretty useless too. They won't be synchronized in a distributed system, and even on a single machine if it's setup to use something like NTP, then the time won't be monotonically increasing since the clock-sync mechanism can move time both forward and backward.
"Each revision includes a list of previous revisions. The revision with the longest revision history list becomes the winning revision. If they are the same, the _rev values are compared in ASCII sort order, and the highest wins. So, in our example, 2-de0ea16f8621cbac506d23a0fbbde08a beats 2-7c971bb974251ae8541b8fe045964219."
But
isn't `<database>/<dataset>` more or less similar to `<database>::<dataset>`? The only difference is the choice of a delimiter to disambiguate between a database and a dataset. For me, the first scheme is much more familiar.
You're just trading one arbitrary thing for another, IMO, but what's worse is you are now abusing the URL specification for the HTTP(S) protocol, so nobody can use existing HTTP URL libraries.
You could easily say everything before either ? or ; always refers to a database, and use a query parameter or a semicolon to delineate a dataset. Or you resource paths:
Glancing over RFC3986 [1], fragment identifiers seem to be pretty much made for what you're trying to communicate with :: - separating a subresource (the dataset) from a primary resource (the database). Unless I'm misunderstanding something?
There are issues with using `:` in an URL, if you plan on using the URL in a way that's compatible with the extant software out there. I remember:
- I remember the Rails community trying to use `;` which broke Mongrel 1. Mongrel's parser was generated from the RFC. There was a huge flame war about that back in the day. The Rails core team at the time thought that Mongrel should make an exception to a reserved character. (And after all was said and done, it got changed back to `/` for that particular use-case).
- When working on IPv6 support about 3 years ago, one of the things I added to an open source Ruby project was IPv6 literals into the URL. This was a case of using `:`. Even though this was defined in the RFC specifying the literal, I found out at that time the Ruby standard library was written in a way that assumes you would never have `:` in the URL other than to delimit the port. I ended up having to do some workarounds for that.
That's with Ruby. I wouldn't be surprised if many other extant libraries parsing URLs that might break -- at least not without escaping those characters.
You don't NEED ":". You NEED some sort of delimiter that can clearly distinguish between database and dataset; you happen to pick ':' to satisfy that. There might be a different delimiter that works better.
The other option is to not pretend that is a URL and call that something else.
Post-script: I think this project is a great idea. I'm looking forward to see how it turns out.
But as I mentioned in my previous reply, there may be unintended consequences. If this is something you guys want to do (and have HTTP/HTTPS URL compatibility) to check it out on different language/platform and see if your scheme breaks things. (And definitely see if Windows library assumes this; Windows file paths uses `:` as a reserved character)
Thanks for the help everyone with this most important aspect of the system ;).
To clarify, we don't think of these specs as URLs. The part before the final double-colon is a URL. To parse one, you get the final double colon, and take everything to the left as a URL.
> To clarify, we don't think of these specs as URLs.
But everyone else will because you are including the protocol, and at the end of the day, they are a uniform way of identifying a resource, so they are functionally URIs.
SQLAlchemy and most DB URIs are good examples on how to do this. For example, you can connect to a MySQL database instance and give it a default namespace/schema/database.
Part of the issue here is the ambiguity between a database, a database instance/server/host, a dataset/table, a catalog/namespace/schema, and what all those words and concepts mean. There's little consensus across fields, because even if computer scientists say "Okay, this is what a dataset actually is", somebody, whether it's a biologist or a physicist, will throw up their arms in protest.
Isn't the append-only design unsuitable for scenarios where many updates/deletes are made? If you update/delete 1GB of your 2GB database each day, then after a year the database is 365GB in size, but the live data is only 2GB.
I think the git-like features (history, merging) are very helpful for internal work, but when the dataset must be published, I think in most cases only the newest snapshot should be made available. But then the question is what format should it have...?
It just depends on the details. If you have a dataset in which 50% of values changes every day, and it doesn't compress well, then yeah, your Noms archive of that entire dataset is going to grow quickly.
In such situations, you could either (eventually, when it is implemented) prune old data, or aggregate the changes into bigger blocks.
Strawman marketing alert: "The most common way to share data today is to post CSV files on a website". Maybe there are a bunch of people that still do that somewhere, but if so, they ain't early adopters of decentralized database technology and so not your target customers. It's always better to talk about what your most likely customers are doing now.
And this isn't that surprising, it's human readable, and gets the job done, and zipping will give decent compression.
I'm not sure who you think the target market for this would be, but I'm sure that if it's an efficient local format, you could probably get the ML crowd on board.
Right. A shocking amount of public data is distributed this way.
Also, we routinely talk to developers who complain about the difficulty of consuming data snapshots from partners, parsing it, trying to understand how it has changed since last time, etc.
With high value datasets, people frequently build an API to combat these problems. But it's hard to design a good API, and even if you succeed, it has to be secured, documented, scaled, and maintained indefinitely.
It'd be entirely up to the distributor of the data, so perhaps the answer to your question is "all of the above".
For example, (1) our command line tools use URL-like paths which implies "use this hostname" (to copy-paste into terminal), (2) we have some in-browser visualisations like http://splore.noms.io/?db=http://demo.noms.io/cli-tour which implies more of a "click here" type UI.
Hmm, if you want people to be able to link to Noms datasets on the web, maybe you should switch to using URLs to name the datasets, instead of a two-part identifier with an URL separated from a dataset name by a "::"? Darcs and Git seem to get by more or less with URLs and relative URLs; do you think that cold work for Noms too?
The super REST harmonious way to do this would be to define a new media-type for Noms databases with a smallish document that links to the component parts. Like torrent files, but using URLs (maybe relative URLs) instead of SHA1 hashes for the components, maybe?
This is a good point. We never thought of these strings as URLs, but there are places where it would be nice to use them that only want URLs (the href attribute, for example).
The way we have it now is nice in that any valid URL can be used to locate a database. I am loathe to restrict that.
The hash portion of a URL is not transmitted to the server by browsers, so it wouldn't help in the case of putting the string into a URL bar or a hyperlink.
If the resource you're linking to is a database (or to speak more strictly, if its only representation is a resource of a noms-database media-type), rather than an HTML page or something, can't the browser can be configured to pass it off to a Noms implementation, complete with the dataset identifier within? I mean, that's what people do with page numbers in PDF files, right?
Not only public data. At my main project I'm testing systems that generally crunch data from various sources, and yes, most of them are in CSV format, and then we process them only slightly (some filtering, aggregation, translation), and spit other CSVs out. I was amazed that the company had not bothered creating more... civilized (?) solution for internal data processing - but I guess that since it works, there's no drive to change it on a whim.
CSV for shoving files around (or TSV or whatever similar thing) is great because it generally just works. I can throw it into virtually any language or system, open and read it myself, grep it, check it with any platform. I can often get away with just looking at the files and absolutely nothing else, though a data dictionary is hugely appreciated.
I don't need to make sure I've got postgres 9.5 setup with a particular user account & set configs for the password, start ES (but not version Y because of a feature change) on port Z, etc. I don't need to manage making sure the two branches I'm looking at don't overlap or try to write to the same database. Keeping multiple results and comparing their output can be easily done as they're just files to be moved. Small tasks that read a file and spit out another can be checkpointed just by making them look to see if the file they expect to create already exists.
I'm hugely in favour of CSV for external data too. Sure, provide other options as well, but I love that the "get all the data" command can be as simple as a curl command. I don't want to read your API docs and build something custom that tries to grab everything, I don't want to iterate over 2M pages, I don't want to deal with timeouts, rate limits, etc. Just give me a URL with a compressed CSV file.
All the problems that come along with it, for me, are related to poor data management which I doubt a format change would fix.
Maybe CSV isn't the best internally, but for a vast amount of cases it's nearly the best and gives you a lot of flexibility. My general advice would be to start with CSV unless you've got a good reason not to, and then try and move to a different line based thing (jsonl, messagepack?). It is highly unlikely to be the biggest problem you have with your data, and the time spent putting it into a more "sane" format is often (in my experience) better spent on QA and analysis of the data itself.
I'd say the current problem is that lots of data is available only either in excel files, pdfs, and APIs pointing to a possibly constantly changing data store.
Is it possible that their target market is not current users of decentralized DBs?
At first glance, it strikes me as a solution for people storing something like scientific data sets rather than application data. In which case, posting CSV files on a website is pretty much best-case scenario.
EDIT: Although, in the "scientific data set" scenario, I'm not sure how much value there would be in storing version history.
Right, this looks like it has far more immediate value for storing data which will be collaboratively mutated, so: a company directory, a knowledgebase, a CRM datastore...
For large scale datasets, I'd be looking at GIS data maybe.
After more reading it kind of sounds likes better couch db?
And cool if so, but the CSV analogy is really confusing.
When I think of CSV I think of a ghetto data exchange format that I can send or receive to a less technical person. As I understand nom, it does not sound like its for non technical people.
At first glance, this reminds me a little bit of datomic - all data history is preserved/deduplicated, fork/decentralization features. Can you comment on how it compares?
I feel weird speaking for them, but at a product level, I think it's fair to characterize Datomic as an application database -- competing with things like mongo, mysql, rethink, etc.
While Noms might be a good fit for certain kinds of application databases (cases where history, or sync, is really important) we're really more focused more on archival, version control, and moving data between systems than being an online transactional database.
Also, at a technical level, unless I'm wildly mistaken, I don't believe that Datomic is content-addressed, and I wouldn't call it "decentralized" (though that word is a bit squishy).
Dat is (currently) focused on synchronizing files in a peer-to-peer network.
Noms can store files, but it is much more focused on structured data. You put individual values (numbers, strings, rows, structs, etc) into noms, using a type system that noms defines, and this allows you to query, diff, and efficiently update that data.
Also Noms isn't peer-to-peer (although we hypothesize that it could run reasonably on top of an existing network like IPFS).
dat has changed so much, and there has been so much hype and tooling (probably now broken) around it, and yet it doesn't seem to be delivering anything, nor there seems to be many data willing to be published with it.
We were also heavily influenced by camlistore (which I hacked on for awhile), irmin, ipfs, and others who have done a lot of interesting work in this space.
Noms should be useful as a backup utility, but I'd say it's especially useful for backing up data which is not files. Think about backing up data which you only have access to via API.
You can take the JSON output of an API and drop it into Noms, then do the same thing tomorrow, and Noms will automatically deduplicate the data as well as give you a nice structured API to read and interact with it.
Another question to help me better understand the tool: who do you see as your competitors, technically? What do you see as viable alternatives to Noms, but worse? (Or better!) I do see that you were inspired by git, but clearly your use-cases are different.
Git is a competitor. It is fairly common to check data (e.g., csv or json files) into Git today.
However, this falls down pretty rapidly. In order to get reasonable diffs, the data has to be sorted, and line-oriented. Also Git just doesn't scale well to larger repos or individual objects.
Otherwise, we see the competitors as the way that people distribute data today - custom APIs, zip files full of CSV, etc.
How would you say Noms compares to Datomic[1]? Both projects are working on the same idea of representing a database as tree of commits over time.
From my quick inspection, it looks like Noms shows some focus towards working in multiple branches, whereas Datomic, at least in its marketing materials, just talks about preserving a single timeline.
Why would you exclusively support schema inference, rather than also allowing users to manually specify their schemas?
Schema inference is very difficult to do correctly and safely, especially with small initial samples of instances (source: work on https://github.com/snowplow/schema-guru).
There's a little bit of terminology overloading going on here.
In Noms every value has a type. It's an immutable system, so this type just is. The type of `42` is `Number`. The type of `"foobar"` is `String`. The type of `[42,44]` is `List<Number>`. And if you add "foo" to that list, the type becomes `List<Number|String>`.
We don't try to infer a general database schema from a few instances of data. We just apply this aggregation up the tree and report the result.
That all said, we do want to eventually add schema _validation_, by which I mean the ability to associate a type with a dataset and have the database enforce that any value committed to the dataset is compatible with that type (following subtyping rules).
It sounds like Noms is dynamically typed, rather like SQLite — types are associated with values, not (just) with datasets. The difference is that SQLite (like Python or JS) only types leaf/atomic data, while you're also typing aggregate data. Is that right?
Right - the challenge is that with dynamic typing and without schema validation, it's incredibly easy to break any strongly typed client/consuming application. You think you are dealing with a `List<Number>`, you have Go/Java/Haskell/whatever apps which are consuming that in a strongly typed fashion using their idiomatic record types, and then suddenly a user accidentally sends in a single value which turns the tree of values into a `List<Number|String>`, and all your consuming apps break.
Given that schema validation ("does this instance match this type?") is simpler to implement than schema inference ("what is the type of this instance?"), it's surprising to me to deliver inference first...
i don't understand how we could have gone in the opposite direction.
Schema validation for us is just looking at the type requirements of the dataset and the type of the value and seeing if they are compatible. How can we do that without first knowing the type of the value?
IPFS is essentially providing a globally decentralized filesystem. Noms is providing (or hopes to provide) a database.
By database, I mean:
- small individual records
- efficient queries, updates, and range scans
- ability to support complex queries
- ability to enforce structural data validity
These are all things that IPFS could eventually grow to support, but in order to do it, I think it would have to grow into or layer something like noms on top.
How would NOMS work for massive amounts of geo-temporal data? with many inserts and queries but few (zero) updates? Efficient queries on geo-temporal keys are useful.
I mean, by their definition it is a database, but i can understand your usage. Then again, they both say it is a database, and in your link, they say it "isn't quite there yet", so /shrug heh.
The HN title suggested it's a database, which made me really curious as I can finally stop using history tables (or wal logging, or the other myriad ways of seeing a point in time). However, that doesn't seem to be the case here?
That said, the idea of "git as a datastore" does seem akin to "blockchain as data verification". Combine those two ideas together, get PWC involved and you have multimillion dollar deals coming in for audit protection.
I've been working on something pretty akin to what you describe, hosted verifiable data structures (logs and maps). Rather than Blockchain it uses the same data structures as Certificate Transparency to provide equivalent functionality. Would love to get some feedback if you had the time to look: https://www.continusec.com/
I've been wanting something like Noms for a while. Prolly trees sound really promising.
In intro.md, you suggest, "If you wanted to find all the people of a particular age AND having a particular hair color, you could construct a second map having type Map<String, Set<Person>>, and intersect the two sets." In that case, how should I keep the two maps in sync? Do I need to atomically update the logic of all the instances of the application to modify both maps instead of just one? Or do I keep the second map (the hair color index) in a separate index database and update the index whenever I pull changes from a remote database? (What does the API look like for getting notified of new changes that haven't been indexed yet?)
I see that "noms sync" does both push and pull. Does that mean I can't pull data from a database I can't write to? How does that work over HTTP — do I need to use a special HTTP server that knows how to accept and authenticate write requests, or can I just dump a Noms dataset in a directory and serve it up with Apache?
Forgive me if these questions are obvious — I've read the docs I could find, but I haven't read any of the code beyond the hr sample.
> Do I need to atomically update the logic of all the instances of the application to modify both maps instead of just one? Or do I keep the second map (the hair color index) in a separate index database and update the index whenever I pull changes from a remote database? (What does the API look like for getting notified of new changes that haven't been indexed yet?)
Currently, you have to manually keep an index up to date. But keep in mind that internally this is what all databases are doing -- manually reflecting changes into indexes -- they just hide it from you.
Eventually, we imagine that there will be tools to declare indexes you want to maintain and we'd do it for you. Note that because Noms is good at diffing, calculating the changes that need to be re-indexed comes for free!
I'm surprised you didn't use a functional language like Haskell or OCaml or Rust to do this, since the article talks about love for functional programming.
I'm not criticizing Go at all, it's just not really a functional language.
Git's author felt the alternative (gratis) systems were lacking. Noms's author, on the contrary, praises Git but doesn't build on it. He chooses to implement the same technology himself, and from the docs it's not clear to me why that is.
Off the top of my head: git can only use sha1 which makes it unsuitable for any use case where you need to cryptographically verify the origin of data (so far nobody was able to tell me definitely how secure git signed commits and tags really are).
Assuming SHA1 has second pre-image resistance (which it currently still does), the security of git signed commits/tags is the same thing as the security of the private key used to sign the commits/tags.
This is really interesting! What are some ideal use cases for the current implementation? I've seen Git is considered a competitor, but Noms also appears to be a generic database, so i would just like to hear some basic use cases, if possible.
Eg: If used as a database, what applications would benefit from Noms? Could/should this be used for personal storage? Could/should this be used for code versioning (ie, Git)?
Firstly, it would be cool if this could be a single gateway to "all the data in the world". Right now its a pain to find, say, energy generation statistics for, say, Portugal, but it would be great if I could do something like:
noms get statistics.industry.energy.portugal.all();
Secondly, the versioning idea could have some really cool applications. For example, I work in data analytics, and sometimes I want to transform some data in an SQL table.
Doing transformations nicely is a bit difficult. Either I'm doing the calculations in a column of a view, with the associated performance hit, or I'm tacking columns onto the table, which quickly leads to a mess, especially during the initial stages of analyses.
It would be so cool if I could treat the database as a constantly-evolving git tree.
I really like the idea in theory, but seeing it in practice I feel the whole thing is too concerned with being a wrapper around git handling for their dataset files.
I would much rather see diffs based around the records themselves, and not so much the structure of the data.
While Git is referenced as an inspiration, the implementation of Noms does not use Git. Noms performs diffs on the data - as records or whatever other structure you used in importing your data. CSV is but one example of a way to import data into Noms, but since so much data is available in that format it is an easy one to reference that most people know. Noms can also import JSON, XML and many other data types if you are willing to write JS or Go code (more to come). Thanks for taking a look at Noms!
You can, and people do that today. It has limitations though:
* The data must be sorted in order for Git to provide good diffs
* It does not scale very well. On my machine, Git refuses to diff files over 1GB (maybe there is a setting for that)
* You must clone the entire repository onto your machine to work with it
* There is no programmatic API -- you must work with the data and changes as text and line diffs
We've been struggling managing a collection of periodically updated CSVs & binaries over a few GB's in size, we struggled with Git-LFS and gave up, and we were considering (dreading) SVN, this looks really promising. Cheers!
Can you elaborate a bit on how the hashing and chunking works?
There's a rolling hash for determining chunk boundaries, and also SHA-512/256 somewhere.
Does the same data chunked differently have a different hash?
Briefly, there are two main hash functions in use in noms.
sha-2 is used to compute the hash of individual chunks. This is the classic use of hashing in content-addressed systems.
We also use a rolling hash to compute chunk boundaries. We do this in the typical way that tools like bup, camlistore, rsync, and others do for large files.
But our observation was that if you squint your eyes, a merkle tree looks a little like a b-tree. So we use a rolling hash to break up huge lists, maps, and sets into trees where nodes are roughly 4KB. So it's a kind of self-balancing, probabilistic, deterministic b-tree thing.
We never chunk the same data differently. An inviolable rule of Noms is that the same logical value is always chunked the same way and always has the same hash.
If I start with integers 1-1000000 and you start with integers 0-999999, and we both make mutations to converge at the same list, we will end up with the exact same tree, with the exact same hashes.
This is what makes efficient synchronization and diff of noms data possible.
Thanks. So it's building a hash tree with deterministic chunking and that means you can cheaply update the hash after updating parts of the tree as you only have to rehash certain bits?
Does that mean that your chunk sizes are kind of fixed? Do you think there's a way to retain that advantage and be able to coalesce smaller chunks into larger ones?
Say your smallest nodes are 4KB but for more efficient storage you might want to go up to 4MB chunks.
Could that be done while retaining the same hash for the same underlying data?
Keep in mind (if this wasn't clear) that the chunks are only probabilistically 4K: https://github.com/attic-labs/noms/blob/master/go/types/roll.... I.e. the thing that's "fixed" here is the chunk size we're aiming for. The chunks themselves could be of any size.
In any case, that's a good question - we might want to do something about that down the line. But, if we did change that constant, the structure of the trees will change, and all[1] the hashes will change.
This looks really interesting. I've been thinking about the problem of distributed issue tracking lately... and the set of sub-problems it has (authorization and authentification, synchronization and so on) ... I'm not sure all these problems could be covered by this, but I guess at least the "distributed"-part could be covered by something like this.
I had an idea for this with a buddy in college after doing case study research into Git. I've always considered this the next step into a decentralized world outside of code and non-typed "text". I know .csv's where mentioned a few times; are you looking to narrow into a few specific file types for proof of concept?
We envision there to be tools that work on certain data types (Noms has a full type system), for example an app that displays all geo locations in a dataset.
This is really interesting, thanks for sharing it!
I haven't had a chance to dig into the code yet, but I notice that you say two replicas of the same database can be disconnected, altered, and then merged. Could you explain how Noms takes care of that, particularly in the case of collisions?
We assume that within a given version of the database format, there will never be a collision. The chances of a sha2 collision are beyond astronomical, and if you can create one, there are better things to do with your time that bother us.
That said, hashes only get weaker over time. The chances of an md5 collision used to be astronomical, now they are not.
So it was important to us to have an escape hatch - a way to increase the strength of the hash we use over time.
That's why we built a format version into Noms from the beginning. Our design is predicated on the fact that within a given version of the format, there is a 1:1 correspondence between hashes and values. Every value has exactly one hash, and every hash encodes exactly one value.
In future versions of the format, we might change the hash function. In this situation, we'd need to import data from the old format to the new format, just like how you have to sometimes migrate traditional databases across versions.
First off... I'm excited to see this project. There's a lot of potential here and this looks like a good implementation of a nice concept. I have at least a bit of authority behind that statement, since a few years ago, I had the opportunity to build something similar (although smaller in ambition.) A couple things to think about:
* Type accretion - This doesn't change the fact that database clients need to be able to accept historical data formats if they need to access historical data. The schema can't be changed for the older data objects without changing the hashes for that data, so there's no way to do something like a schema migration would work in SQL. For simple schema changes like adding fields, this might not be so hard to deal with, but some changes will be structural in nature and change the relative paths between objects. (This adds complexity to the code of database clients, as well as testing effort.)
* Security - Is there a way to secure objects stored within noms? Let's say I store $SECRET into noms and get back a hash. Does it then become the case that every user with access to the database and the hash can now retrieve the $SECRET? What if permissions need to be granted or revoked to a particular object after it's been stored? A field within a particular object? What if an object shouldn't have been stored in the database at all and needs to be obliterated? (This last problem gets worse if the object to be obliterated contains the only path to data that needs to be retained.)
* Performance - The CAS model effectively takes the stored data, runs it through a blender, and returns you a grey goo of hashes...this is good for replication, but it means you can't get much meaningful information out of a hash. This tends to mean a lot of operations like you might find in an old-school navigational database, and a huge dependency on the time to fetch an object given a hash. Indices can help by reducing the complexity of the traversals you need to do, but only if they're current and you have the index you need.
* Data roll off - How do you expire off data so that it doesn't just monotonically increase in volume? Let's say there's an API to mark an object as purgeable, the problem of identifying other purgeable objects turns into effectively a garbage collection process. (git gc, etc.) There's also the issue of the sheer number of objects that can be involved. The system I was involved with had something like 500K objects/day that had to be purged after 120 days in the system. (Total of 60MM objects line and around 6TB or so) Identifying 500K objects to purge and then specifying those to the data layer for action is not necessarily an easy thing....
* Querying - Server side query logic (and an expression language) is basically essential to performance. Otherwise, you wind up with a network round trip for every edge of the graph you follow. Going back to my first point, whatever querying language is used has to be flexible enough to handle a schema that might be varying over time (through schema accretion).
All four of these bullet points are worthy of a great deal more discussion, and I haven't even broached issues around conflict resolution, differencing, UI concerns, etc. I think there are good approaches to managing lots of these issues, but there's a bunch of engineering involved, as well as some close attention to scope and goals...
- Type accretion: I don't think in general that schema changes like what happens in sql databases works very well (I say this having worked on such systems). In big systems, it's hard to get everyone to agree on a moment to CHANGE THE SCHEMA. You can certainly do something like that in Noms -- just write a new dataset and replace the old one. But being able to read old data and leave old clients working I think is powerful. Couple this with the structural typing that falls naturally out of Noms and - I think - you have a more flexible way to change schemas over time.
- Perf: I'm not really following you here. CAS has some positives and some negatives for performance.
- expiration: 1. there are a huge number of systems today that never delete data. Taking advantage of that to make other operations faster makes sense. 2. yeah, it's a gc problem. luckily gc is a well-studied problem. Also, as Noms is a merkle tree and merkle trees are good at diff, we have some additional leverage. We don't need to do a full scan everytime.
- querying: disagree that it is essential to perf. Another option is to have a schema that matches your access model. You can do that server-side in addition (or instead) of having a query language.
===
It sounds like you have thought a lot about all of this! If you are interested, your brain would be very appreciated in the github or slack.
> It sounds like you have thought a lot about all of this!
Up until around 2014, I was heavily involved in the construction of a small CAS (100MM objects online, around 5-6TB in size) for a client that needed to replicate certain periodic calculations in a reliable way. It worked well, but something like noms would have eliminated the need for a bunch of custom work.
> If you are interested, your brain would be very appreciated in the github or slack.
After watching the screencast, all I saw was a bunch of commands explained (could have read the docs for that), instead, I'd like to walk through a use-case where this solves someones problem.