Show HN: Noms – A new decentralized database based on ideas from Git

nartz · on Aug 2, 2016

So, i realize this project is early, but it would be EXTREMELY helpful to walk through someone's use case - like, who is the target here? A business analyst who iterates on cleaning / analyzing small excel csvs? Or someone else?

After watching the screencast, all I saw was a bunch of commands explained (could have read the docs for that), instead, I'd like to walk through a use-case where this solves someones problem.

wingerlang · on Aug 3, 2016

He mentioned at least one. His friend went to a cabin where there were little to no internet connection, then he updated the database on a local device. And later on other database-nodes would just pull the updated data.

Seems like it is for personal use. Maybe something to build apps on top of.

tamana · on Aug 3, 2016

That doesn't differentiate noms from postgres or any other multimaster database.

Does it do clever merging?

wingerlang · on Aug 3, 2016

I just went by the use case he mentioned in the video. I don't really know about these technical details for databases etc.

rejschaap · on Aug 3, 2016

This is clearly a technology-driven project. Which is fine, maybe business/use cases will emerge.

dnautics · on Aug 3, 2016

Benchling could use this.

penetrarthur · on Aug 3, 2016

sriku · on Aug 3, 2016

Yup. Their journey started with Camlistore. Which appears to be a nice personal CMS with built-in syncing.

im_down_w_otp · on Aug 2, 2016

GC I can see a shape of solution for since you can use something like a per-object DVVset to determine the minimum set of unresolved histories required to avoid losing data during conflicts while not unnecessarily ballooning the size of the dataset.

However, the inner-object conflict-resolution problem seems a lot harder to solve given that there's no obvious join-semilattice for arbitrary fields/data. Can you discuss what conflict-resolution strategies you're working on for auto-resolution and/or what metadata you intend to provide to the end-user in the event that you're going to punt resolution to them to handle?

Given this is supposed to be for collaborative workloads, the conflict-resolution issue seems to be a cornerstone. Git handles this by inserting sibling sections into the documents and forcing the end-user to manually deal with fixing problems, which is often fraught with pain and peril, and doesn't seem like a strategy that would work for something that's a database (as opposed to something that's a workflow).

rafaelweinstein · on Aug 2, 2016

This is a question that we've gotten quite a bit. It's our view that there's no magic solution to conflicts. There are logical conflicts in the real world that must be arbitrated.

That said, it's a surprisingly basic thing, but just knowing what changed from party (a) and party (b)'s perspective (relative to their most recently agreed-upon state) is somewhat rare or ad-hoc in existing systems. In noms, you can directly compute exactly how state diverged and apply whatever resolution strategy is suitable.

We have plans for applying default conflict resolution for changes to data-types that - in many cases - will be correct, but in the end, there's no avoiding that correctness can only be defined within a given specific domain.

josephg · on Aug 2, 2016

I've played with this problem on and off over the last few years. ShareDB[1] is powered by JSON OT[2], in which each change describes the meaning behind what you're trying to do. (For example, 'increment counter' is different from 'change counter from 2 to 3'. They look the same, but behave differently in the case of conflicts). Just knowing what changed often isn't enough to do proper resolution.

I've spent years on and off playing with a better, faster, stronger version of the JSON OT code[1] which also supports arbitrary object reparenting. You run into problems where you really want conflicts as well. For example, given {x:{}, y:{}} user A moves x into y, and user B moves y into x. There's no good solution to resolving this without more information or conflict markers & humans.

Doing this in a P2P setting is hard & interesting. Very cool stuff though!

[1] https://github.com/share/sharedb [2] https://github.com/ottypes/json0 [3] https://github.com/josephg/json1

adekok · on Aug 3, 2016

> Just knowing what changed often isn't enough to do proper resolution.

This is why my git commit logs are sometimes:

     perl -p -i -e 's/FOO/BAR/g' $(find src -name "*.[ch]" -print)

Having a "high level" description of what changed makes manual merges easier.

I wish git had conflict resolution like this. Treating data as an arbitrary sequence of bits is general and correct. Treating data as having a particular format can be useful, too.

puetzk · on Aug 3, 2016

It does? git merge --strategy xyz means git will invoke git-merge-xyz to do the actual conflict resolution. It comes with a variety of built-in ones (see https://git-scm.com/docs/merge-strategies) but you can write more if you have some special-purpose approach you want to use.

Git also has smudge/clean filters if you want to transform your file to a format where line-by-line textual merge is more meaningful

And you can use .gitattributes to make certain extensions/folders/file/whatever default to a certain treatment.

A good example is https://bitbucket.org/sippey/zippey which unpacks zip archives (and hence file formats based on them, like .jar, .docx, etc) to allow the contents within to be tracked in the git repo better. Other custom formats could do something similar...

espadrine · on Aug 3, 2016

To me, there are two separate concerns: intent preservation and coherence.

Operations cause a change of state. The sequence of operations that are performed given a user action need to ensure that they will cause a modification of the state of the database in the way intended by the user.

Coherence is about maintaining a design. A set of rules are conceived. For each state, it is possible to determine whether the state is valid according to them, and if so, the database is coherent.

To maintain coherence, it is possible to deny an operation, therefore breaking user intent. To maintain user intent, it is possible to accept an operation that leads to an invalid state. From what I understand, noms is heavily tilted towards coherence, like git, but unlike git, it doesn't have the social "pull request" aspect, nor a support for running tests.

You mention diffing as a plus, but realistically the only use of diffing is to guess intent. Real intent can only be provided by a maximally rich set of operations. Logs of SQL queries, for instance, are more likely to provide insight than a cold diff. For a database that is tilted so far towards maintaining coherence at the expense of user intent, it may be a good idea to compensate by staying close to the operations.

Finally, if you decide to tilt towards intent preservation, there are definitely approaches to automate merges, avoiding nagging the user to fix conflicts. The most successful trivial solution remains latest-write-wins, which gives surprisingly good results assuming a high data granularity and a rich set of operations. But unless there is a way to automate coherence validation (which the equivalent in git projects is, I suppose, running the tests), we can only rely on user attention… in which case, relying on the user to fix coherence after the fact on a database that heavily preserves user intent would be pretty much the same.

So… do you plan on supporting custom coherence rules? Alternatively, which conflict resolutions are you leaning towards?

rafaelweinstein · on Aug 3, 2016

Much of this is yet to be designed, but in general, our approach has been to lean towards a relatively layered system. That is to say, that at the lowest level, noms probably won't make a judgement about the trade-off you describe above, but rather provide primitives which allow layers above to more easily take opinionated positions.

In the near-term we'll be paying attention to specific cases that arise, and we'd welcome the opportunity to learn more about those that you may encounter.

im_down_w_otp · on Aug 3, 2016

What is "latest" in a situation where you're experiencing concurrent writes?

espadrine · on Aug 3, 2016

Google's Spanner, for instance, relies on their TrueTime design, which requires having a GPS clock and an atomic clock on each datacenter, I believe. Most designs simply rely on NTP or a similar time synchronization system.

Another approach is to maintain total order of writes. Assuming some form of consensus protocol to determine write order, the unicity of the order ensures synchronization. That design, however, tends to preserve user intent less. Bitcoin has a form of that.

rdtsc · on Aug 3, 2016

> which requires having a GPS clock and an atomic clock on each datacenter,

Right. I'll download an open source database, then buy a commercial GPS clock and attach to it.

https://www.amazon.com/Spectracom-1200-033-SECURESYNC-MODULA...

Only $12k for a Spectracom. Granted they are nice.

Well, I guess if it demonstrates anything is one fact -- if even Google needed GPS hardware to provide "latest" in a distributed system, then OP is right. Time in distributed systems is very hard.

> Most designs simply rely on NTP or a similar time synchronization system.

Not sure if "simply" was mean sarcastic or not. If reliability and deletion of user's data is important, and it relies on getting NTP time, I would strongly advice not to use that distributed db system.

im_down_w_otp · on Aug 3, 2016

Time isn't a reliable resource in this context.

espadrine · on Aug 3, 2016

They mention striving to support many contexts. In the demo, they showcase offline editing of a single CSV entry. If the granularity is the atomic types, then a conflict only occurs when the very same field in the very same row is concurrently edited.

Then, the system can show the conflict and offer a default that keeps the operation with the highest timestamp, or if the timestamps are identical, the one with the highest hash.

svckr · on Aug 2, 2016

> [...] forcing the end-user to manually deal with fixing problems, which is often fraught with pain and peril, and doesn't seem like a strategy that would work for something that's a database (as opposed to something that's a workflow).

Isn't that what (for example) CouchDB does? I believe the reasoning is that conflict resolution is often application specific, so why not deal with it in the application?

im_down_w_otp · on Aug 2, 2016

Riak does this when not using CRDT data types, but the problem is often that there's no clear application or user way to deal with this either most of the time because it requires an awful lot of context in order to make a good decision.

Using the same example of how Git deals with this. Think about times you've gone through a merge conflict process on a chunk of conflicting code where you don't really have any knowledge or context for why the other stuff that's not yours is even there, and say you don't have a way to collaborate with the other developer or someone in leadership to make sense of those parallel efforts. You can only make sane decisions about complex conflict resolution when you have a lot... a lot... of surrounding context and intent. You need extra metadata, and the underlying system needs to expose that to you. That system being your engineering process, manager, co-worker or in this case that system being the database.

Hence my interest in how they're intending to expose this set of concerns to the user.

fiatjaf · on Aug 2, 2016

In case of conflicts, CouchDB assumes the most modified branch of the document (i.e., the document with the higher revision number) is the winner. You can resolve the conflict by choosing a different branch/revision manually, but you can also choose to not do anything.

rdtsc · on Aug 3, 2016

Yes, it picks a winner, which it show on all machines (so all machines that have seen same changes will pick the same winner). But it also keeps conflicts around, so users who care about them can correctly resolve them.

Sometimes the winner it picks is not what the users want, That could surprising, but it is correct because it really is a user-level conflicts.

(Now, user may very well at a timestamp field to the document, hope ntp works well and resolve the conflicts if they appear based on that, but CouchDB tries not to make such assumption on behalf of the user).

im_down_w_otp · on Aug 2, 2016

Really? That's nearly undifferentiated from just picking one at random. How is, "whoever hits the queue most often" a useful deterministic resolution strategy? I mean I guess it's functionally no worse than wall-clock time or something, but still kinda funny. :-)

rdtsc · on Aug 3, 2016

It is not random. It is consitently picking the same document on all servers that have seen the same changes. By default it picks the one with the most changes. That consintency ("the same" part is very important) it means if you replicate and bring in some conflicts, both sides will show the same state. So you won't randomly after replicating A to B, and B to A see document 1 as the winner on A but 2 on B. They'll both pick 1 or 2. So both would settle on the same state.

Also it doesn't delete or remove conflicting siblings, it is very good about not doing that to user data. Users only know exactly how to solve particular conflicts.

Lazare · on Aug 3, 2016

> How is, "whoever hits the queue most often" a useful deterministic resolution strategy?

It's a deterministic resolution strategy, and is thus useful.

> I guess it's functionally no worse than wall-clock time or something,

Wall-clock time is not deterministic; therefore it's far worse.

When dealing with distributed systems, deterministic processes are critical. Multiple systems all being right is awesome, but multiple systems being wrong in different ways is a nightmare. :)

im_down_w_otp · on Aug 3, 2016

Is it deterministic in a way that's useful? From the perspective of the end-user its going to appear random because they don't control the system environment where "highest revision number" can mean something useful to them. In fact, the CouchDB guide even alludes to this when they talk about not relying on this scheme for complex conflict resolution it seems.

Two nodes split. A and B. Say there are 100 updates to A and 500 updates to B. The split heals, the system picks B because 500 > 100, but the write you actually want to dominate is A. The user can't control which replica gets hit more often, or when a split happens, so while this might be deterministic inside the DB it is semantically random from the user's perspective. So the system can make the same choice on all replicas, assuming it can guarantee it has seen all replicas, which I guess allows you to push merging management to each replica instead of requiring an intermediate coordination replica and then re-publishing the merge state to the replicas? So there's a system optimization benefit there.

But consider if the system did pick a winner at random, how would this look any different to the user? The user doesn't necessarily know if A or B should be picked.

Deterministic behavior is really important, but it seems like it really only looks non-random to the end user when deterministically picking a least upper bound for converging a join-semilattice or when all operations on the data are commutative or idempotent doesn't it?

Lazare · on Aug 3, 2016

> Is it deterministic in a way that's useful?

Yes, because it allows to maintain a consistent state across distributed nodes.

> From the perspective of the end-user its going to appear random

...but consistent. If every node picks a random revision on conflict, then when multiple clients try to continue editing, they'll end up increasing the conflicts.

> The split heals, the system picks B because 500 > 100, but the write you actually want to dominate is A. The user can't control which replica gets hit more often, or when a split happens, so while this might be deterministic inside the DB it is semantically random from the user's perspective.

Yeah, but what happens if B picks B, and A picks A? Now the write you're looking for is either there or not there, depending on on which node you're talking to.

> how would this look any different to the user?

Everything is going to look random to the user, no?

fiatjaf · on Aug 3, 2016

It is just that random if the systems remain isolated for a long time. Since CouchDB requires you to send the last revision number when updating a document, if the systems are are live replicating between themselves, the guy who is hitting the queue more rapidly will be forced to fetch the latest winning revision every time before hitting the queue (that may be a revision from a different guy). This will give him time to think about the revision he just received, perhaps examine the document linked to that revision number, see if everything is place, perhaps merge changes himself manually... all that before updating the document in the database.

im_down_w_otp · on Aug 3, 2016

This seems to only make sense in the case that I'm sure I've been able to read the most recent writes from all replicas, no?

I feel like there should be some other method to carry on this discussion besides this thread about Noms :-)

carterehsmith · on Aug 3, 2016

I think the OP meant "most recently changed", not "most changed" :)

fiatjaf · on Aug 3, 2016

No, I meant "most changed".

I don't see how this could be better for a deterministic approach. The recommendations are always that the developer must implement a saner way to resolve the conflicts.

In the CouchDB world, however, I have the impression that conflict resolution is ignored most of times, so we are left with this.

(I say this based on what I do, other people's code I read on the internet and the concerns of the CouchDB core developers about educating users and developers to setup saner conflict resolution approaches themselves.)

im_down_w_otp · on Aug 3, 2016

I don't think so. Though "most recently changed" is pretty useless too. They won't be synchronized in a distributed system, and even on a single machine if it's setup to use something like NTP, then the time won't be monotonically increasing since the clock-sync mechanism can move time both forward and backward.

Having now looked at it out of curiousity... http://guide.couchdb.org/draft/conflicts.html

"Each revision includes a list of previous revisions. The revision with the longest revision history list becomes the winning revision. If they are the same, the _rev values are compared in ASCII sort order, and the highest wins. So, in our example, 2-de0ea16f8621cbac506d23a0fbbde08a beats 2-7c971bb974251ae8541b8fe045964219."

Weird.

zphds · on Aug 2, 2016

Going through the SDK docs, why was a scheme like 'http://localhost:8000::people' chosen instead of the plain old 'http://localhost:8000/people'? Are there any benefits? If yes, curious to know what they are.

bkalman · on Aug 2, 2016

See https://github.com/attic-labs/noms/blob/master/doc/spelling.... -

In this case, we need to be able to address either a database and a dataset. The presence of a :: makes it unambiguous.

zphds · on Aug 2, 2016

But isn't `<database>/<dataset>` more or less similar to `<database>::<dataset>`? The only difference is the choice of a delimiter to disambiguate between a database and a dataset. For me, the first scheme is much more familiar.

bkalman · on Aug 2, 2016

Say we did just do <database>/<dataset>. What does the path "http://demo.noms.io/cli-tour/sf-fire-inspections/raw" refer to? Is the database "http://demo.noms.io" and the dataset "cli-tour/sf-fire-inspections/raw"? Is the database "http://demo.noms.io/cli-tour/sf-fire-inspections" and the dataset "raw"?

In our sample data (see https://github.com/attic-labs/noms/blob/master/doc/cli-tour.... for example) we actually have this exact path, and the database is "http://demo.noms.io/cli-tour" and the dataset is "sf-fire-inspections/raw". We need the "::".

Allowing "/" in a dataset name is very convenient (it's common in git branches). Allowing "/" in database names is essential for URLs.

batbomb · on Aug 2, 2016

You're just trading one arbitrary thing for another, IMO, but what's worse is you are now abusing the URL specification for the HTTP(S) protocol, so nobody can use existing HTTP URL libraries.

You could easily say everything before either ? or ; always refers to a database, and use a query parameter or a semicolon to delineate a dataset. Or you resource paths:

Address a dataset:

    http://demo.noms.io/?dataset=cli-tour/sf-fire-inspections/raw
    http://demo.noms.io/;cli-tour/sf-fire-inspections/raw
    http://demo.noms.io/dataset/cli-tour/sf-fire-inspections/raw

Address database (catalog):

    http://demo.noms.io/database/cli-tour/sf-fire-inspections
    http://demo.noms.io/catalog/cli-tour/sf-fire-inspections

Address dataset in that database:

    http://demo.noms.io/database/cli-tour/sf-fire-inspections;raw
    http://demo.noms.io/database/cli-tour/sf-fire-inspections?dataset=raw

adambrenecki · on Aug 3, 2016

Why not have the dataset name as a fragment in the URL? For instance:

    http://demo.noms.io/cli-tour#sf-fire-inspections/raw

Glancing over RFC3986 [1], fragment identifiers seem to be pretty much made for what you're trying to communicate with :: - separating a subresource (the dataset) from a primary resource (the database). Unless I'm misunderstanding something?

[1]: https://tools.ietf.org/html/rfc3986#section-3.5

hosh · on Aug 2, 2016

There are issues with using `:` in an URL, if you plan on using the URL in a way that's compatible with the extant software out there. I remember:

- I remember the Rails community trying to use `;` which broke Mongrel 1. Mongrel's parser was generated from the RFC. There was a huge flame war about that back in the day. The Rails core team at the time thought that Mongrel should make an exception to a reserved character. (And after all was said and done, it got changed back to `/` for that particular use-case).

- When working on IPv6 support about 3 years ago, one of the things I added to an open source Ruby project was IPv6 literals into the URL. This was a case of using `:`. Even though this was defined in the RFC specifying the literal, I found out at that time the Ruby standard library was written in a way that assumes you would never have `:` in the URL other than to delimit the port. I ended up having to do some workarounds for that.

That's with Ruby. I wouldn't be surprised if many other extant libraries parsing URLs that might break -- at least not without escaping those characters.

See: https://perishablepress.com/stop-using-unsafe-characters-in-...

You don't NEED ":". You NEED some sort of delimiter that can clearly distinguish between database and dataset; you happen to pick ':' to satisfy that. There might be a different delimiter that works better.

The other option is to not pretend that is a URL and call that something else.

Post-script: I think this project is a great idea. I'm looking forward to see how it turns out.

hosh · on Aug 2, 2016

And just to be clear on this: the `::` might not be a big deal if it happens after the `/` delimiter specifying the host part.

So:

http://localhost:8000::dataset

may break code that tries to discern the host name. However:

http://localhost:8000/::dataset

Might not. Further, you could also reserve `_` in your scheme to refer to the default database:

http://localhost:8000/_::dataset

But as I mentioned in my previous reply, there may be unintended consequences. If this is something you guys want to do (and have HTTP/HTTPS URL compatibility) to check it out on different language/platform and see if your scheme breaks things. (And definitely see if Windows library assumes this; Windows file paths uses `:` as a reserved character)

arekkas · on Aug 2, 2016

why break something that's already solved a gazillion times. go open standards, don't create your own.

iso-8859-1 · on Aug 2, 2016

Java breaks:

groovy -e "new URL('http://localhost:8000::people')" Caught: java.net.MalformedURLException

Python breaks:

>>> urlparse('http://localhost:8000::people') ParseResult(scheme='http', netloc='localhost:8000::people', path='', params='', query='', fragment='')

msane · on Aug 3, 2016

:: breaks the url for clients / is not supported in the URL specs. Use the fragment or query.

aboodman · on Aug 3, 2016

Thanks for the help everyone with this most important aspect of the system ;).

To clarify, we don't think of these specs as URLs. The part before the final double-colon is a URL. To parse one, you get the final double colon, and take everything to the left as a URL.

There's some info on the syntax here:

https://github.com/attic-labs/noms/blob/master/doc/spelling....

Though it's not presented as a formal grammar in that doc, our most important criteria for the syntax was:

  - unambiguousness
  - interacts well with the shell, since we frequently use these as part of command lines

batbomb · on Aug 3, 2016

> To clarify, we don't think of these specs as URLs.

But everyone else will because you are including the protocol, and at the end of the day, they are a uniform way of identifying a resource, so they are functionally URIs.

Otherwise, you should probably either conform to the HTTP(S) protocol spec or makeup your own, e.g. noms+http://dbinstance.noms.foo::database/dataset

SQLAlchemy and most DB URIs are good examples on how to do this. For example, you can connect to a MySQL database instance and give it a default namespace/schema/database.

Part of the issue here is the ambiguity between a database, a database instance/server/host, a dataset/table, a catalog/namespace/schema, and what all those words and concepts mean. There's little consensus across fields, because even if computer scientists say "Okay, this is what a dataset actually is", somebody, whether it's a biologist or a physicist, will throw up their arms in protest.

zphds · on Aug 3, 2016

> To clarify, we don't think of these specs as URLs

That makes it a lot clearer. :). Looking forward to take noms for a spin soon.

2bitencryption · on Aug 2, 2016

Perhaps they were writing so much in Go that they set the '/' key to shortcut to '::'.

...but yeah, I am also curious.

HammadB · on Aug 2, 2016

What does :: mean in go?

nathancahill · on Aug 2, 2016

Maybe parent meant :=

Not sure what :: would be.

fizzbatter · on Aug 2, 2016

Nothing, i believe... in Go, at least.

pinko · on Aug 2, 2016

My queston is on scalability. You say "large datasets" on the website. What is large? 1x/10x/100x Terabytes? 1x/10x/100x Petabytes?

What kind of access rates? Etc.

Very general answers are okay -- I'm trying to wrap my head around whether this is even in the right ballpark for my world.

Distinguishing current proof-of-concept vs. design-goal scale is okay too.

Thanks!

aboodman · on Aug 2, 2016

Honest answer is: we don't know yet -- we're working our way up from the bottom.

But we (cautiously) don't see any reason why the basic design shouldn't scale to very large (e.g. petabyte) datasets, and that is our eventual goal.

That said, we do think there are a lot (even maybe the majority) of use cases in the GB-TB range.

aartur · on Aug 2, 2016

Isn't the append-only design unsuitable for scenarios where many updates/deletes are made? If you update/delete 1GB of your 2GB database each day, then after a year the database is 365GB in size, but the live data is only 2GB.

I think the git-like features (history, merging) are very helpful for internal work, but when the dataset must be published, I think in most cases only the newest snapshot should be made available. But then the question is what format should it have...?

aboodman · on Aug 2, 2016

It just depends on the details. If you have a dataset in which 50% of values changes every day, and it doesn't compress well, then yeah, your Noms archive of that entire dataset is going to grow quickly.

In such situations, you could either (eventually, when it is implemented) prune old data, or aggregate the changes into bigger blocks.

tlb · on Aug 2, 2016

Strawman marketing alert: "The most common way to share data today is to post CSV files on a website". Maybe there are a bunch of people that still do that somewhere, but if so, they ain't early adopters of decentralized database technology and so not your target customers. It's always better to talk about what your most likely customers are doing now.

archgoon · on Aug 2, 2016

This is actually extremely common.

For example, if you browse the UC Irvine ML datasets

https://archive.ics.uci.edu/ml/index.html

You'll find that many are in csv format.

If you do a search on data.gov

http://catalog.data.gov/dataset#sec-res_format

You'll see that it's about as popular as JSON.

Also, the World Health Organization

http://www.who.int/tb/country/data/download/en/

Also, many of the datasets at kaggle are in csv format.

https://www.kaggle.com/datasets

And this isn't that surprising, it's human readable, and gets the job done, and zipping will give decent compression.

I'm not sure who you think the target market for this would be, but I'm sure that if it's an efficient local format, you could probably get the ML crowd on board.

aboodman · on Aug 2, 2016

Right. A shocking amount of public data is distributed this way.

Also, we routinely talk to developers who complain about the difficulty of consuming data snapshots from partners, parsing it, trying to understand how it has changed since last time, etc.

With high value datasets, people frequently build an API to combat these problems. But it's hard to design a good API, and even if you succeed, it has to be secured, documented, scaled, and maintained indefinitely.

spotman · on Aug 2, 2016

So if nom takes off would you see 'download a nom dataset by clicking here'

or would it be 'use this hostname to sync the nom to your own computer'

Or would it be a dsn sort of and you just instantiate a client and your on your way?

Or, some combo?

bkalman · on Aug 2, 2016

It'd be entirely up to the distributor of the data, so perhaps the answer to your question is "all of the above".

For example, (1) our command line tools use URL-like paths which implies "use this hostname" (to copy-paste into terminal), (2) we have some in-browser visualisations like http://splore.noms.io/?db=http://demo.noms.io/cli-tour which implies more of a "click here" type UI.

aboodman · on Aug 2, 2016

In some glorious future world you might see things like:

``` <a href="http://www.who.int/tb/country/data/download/en/::case-data/b... the Data</a> ```

kragen · on Aug 3, 2016

Hmm, if you want people to be able to link to Noms datasets on the web, maybe you should switch to using URLs to name the datasets, instead of a two-part identifier with an URL separated from a dataset name by a "::"? Darcs and Git seem to get by more or less with URLs and relative URLs; do you think that cold work for Noms too?

The super REST harmonious way to do this would be to define a new media-type for Noms databases with a smallish document that links to the component parts. Like torrent files, but using URLs (maybe relative URLs) instead of SHA1 hashes for the components, maybe?

aboodman · on Aug 3, 2016

This is a good point. We never thought of these strings as URLs, but there are places where it would be nice to use them that only want URLs (the href attribute, for example).

The way we have it now is nice in that any valid URL can be used to locate a database. I am loathe to restrict that.

Interesting point though - thank you!

kragen · on Aug 4, 2016

Sure, I hope the ideas are useful! As some other commenters have said, if you just use # instead of ::, I think the problem goes away?

aboodman · on Aug 4, 2016

The hash portion of a URL is not transmitted to the server by browsers, so it wouldn't help in the case of putting the string into a URL bar or a hyperlink.

kragen · on Aug 4, 2016

If the resource you're linking to is a database (or to speak more strictly, if its only representation is a resource of a noms-database media-type), rather than an HTML page or something, can't the browser can be configured to pass it off to a Noms implementation, complete with the dataset identifier within? I mean, that's what people do with page numbers in PDF files, right?

aboodman · on Aug 5, 2016

Hm. True.

adamhepner · on Aug 3, 2016

Not only public data. At my main project I'm testing systems that generally crunch data from various sources, and yes, most of them are in CSV format, and then we process them only slightly (some filtering, aggregation, translation), and spit other CSVs out. I was amazed that the company had not bothered creating more... civilized (?) solution for internal data processing - but I guess that since it works, there's no drive to change it on a whim.

IanCal · on Aug 3, 2016

CSV for shoving files around (or TSV or whatever similar thing) is great because it generally just works. I can throw it into virtually any language or system, open and read it myself, grep it, check it with any platform. I can often get away with just looking at the files and absolutely nothing else, though a data dictionary is hugely appreciated.

I don't need to make sure I've got postgres 9.5 setup with a particular user account & set configs for the password, start ES (but not version Y because of a feature change) on port Z, etc. I don't need to manage making sure the two branches I'm looking at don't overlap or try to write to the same database. Keeping multiple results and comparing their output can be easily done as they're just files to be moved. Small tasks that read a file and spit out another can be checkpointed just by making them look to see if the file they expect to create already exists.

I'm hugely in favour of CSV for external data too. Sure, provide other options as well, but I love that the "get all the data" command can be as simple as a curl command. I don't want to read your API docs and build something custom that tries to grab everything, I don't want to iterate over 2M pages, I don't want to deal with timeouts, rate limits, etc. Just give me a URL with a compressed CSV file.

All the problems that come along with it, for me, are related to poor data management which I doubt a format change would fix.

Maybe CSV isn't the best internally, but for a vast amount of cases it's nearly the best and gives you a lot of flexibility. My general advice would be to start with CSV unless you've got a good reason not to, and then try and move to a different line based thing (jsonl, messagepack?). It is highly unlikely to be the biggest problem you have with your data, and the time spent putting it into a more "sane" format is often (in my experience) better spent on QA and analysis of the data itself.

I'd say the current problem is that lots of data is available only either in excel files, pdfs, and APIs pointing to a possibly constantly changing data store.

napoleond · on Aug 2, 2016

Is it possible that their target market is not current users of decentralized DBs?

At first glance, it strikes me as a solution for people storing something like scientific data sets rather than application data. In which case, posting CSV files on a website is pretty much best-case scenario.

EDIT: Although, in the "scientific data set" scenario, I'm not sure how much value there would be in storing version history.

jameshart · on Aug 2, 2016

Right, this looks like it has far more immediate value for storing data which will be collaboratively mutated, so: a company directory, a knowledgebase, a CRM datastore...

For large scale datasets, I'd be looking at GIS data maybe.

spotman · on Aug 2, 2016

Yea and then a paragraph about git under it, made me double take to think maybe they meant CVS, but then realized they really did mean CSV.

If this is supposed to replace CSV cool, but there is a lot of ways to cross that bridge.

Curious however, what the target use case is. Is it a format, or a database , or both?

spotman · on Aug 2, 2016

After more reading it kind of sounds likes better couch db?

And cool if so, but the CSV analogy is really confusing.

When I think of CSV I think of a ghetto data exchange format that I can send or receive to a less technical person. As I understand nom, it does not sound like its for non technical people.

penetrarthur · on Aug 3, 2016

And Dropbox is just an FTP with SVN.

spraak · on Aug 2, 2016

Yeah, that totally made me laugh. Good point about the target audience.

joshmarlow · on Aug 2, 2016

Very cool. I would love to have something like this production ready. Some day...

Anyone who finds this interesting may also be intrigued by Irmin [0] - a library for applications to persist data in a git-compatible format.

[0] - https://github.com/mirage/irmin

seliopou · on Aug 2, 2016

Docker for Mac > About Docker > Acknowledgements, Cmd+F: Irmin

Not only has Irmin been around longer (with full JS support thanks to js_of_ocaml) it also has a pretty big deployment under its belt.

latortuga · on Aug 2, 2016

At first glance, this reminds me a little bit of datomic - all data history is preserved/deduplicated, fork/decentralization features. Can you comment on how it compares?

aboodman · on Aug 2, 2016

Thanks, we will take that as a compliment.

I feel weird speaking for them, but at a product level, I think it's fair to characterize Datomic as an application database -- competing with things like mongo, mysql, rethink, etc.

While Noms might be a good fit for certain kinds of application databases (cases where history, or sync, is really important) we're really more focused more on archival, version control, and moving data between systems than being an online transactional database.

Also, at a technical level, unless I'm wildly mistaken, I don't believe that Datomic is content-addressed, and I wouldn't call it "decentralized" (though that word is a bit squishy).

marknadal · on Aug 2, 2016

Adding to the "is similar to" list here (rather than spreading them around the HN comment thread).

How does it compare to:

- https://github.com/amark/gun - https://github.com/substack/wikidb

lachenmayer · on Aug 2, 2016

This looks really exciting, congrats to the team for launching!

Could you tell us a bit about how this compares to dat? http://dat-data.com/

aboodman · on Aug 2, 2016

Dat is (currently) focused on synchronizing files in a peer-to-peer network.

Noms can store files, but it is much more focused on structured data. You put individual values (numbers, strings, rows, structs, etc) into noms, using a type system that noms defines, and this allows you to query, diff, and efficiently update that data.

Also Noms isn't peer-to-peer (although we hypothesize that it could run reasonably on top of an existing network like IPFS).

lachenmayer · on Aug 2, 2016

Some of the node hacker community's work on scuttlebutt, hyperlog etc also seems highly relevant, as it's all based on merkle DAGs.

https://github.com/ssbc/secure-scuttlebutt https://github.com/mafintosh/hyperlog

These are all super modular so fishing around in their dependency graphs should lead you to a whole bunch of really interesting projects :)

fiatjaf · on Aug 2, 2016

dat has changed so much, and there has been so much hype and tooling (probably now broken) around it, and yet it doesn't seem to be delivering anything, nor there seems to be many data willing to be published with it.

aboodman · on Aug 2, 2016

Hi all. I'm one of the creators of Noms. Happy to answer any questions!

the_duke · on Aug 2, 2016

Are you aware of https://github.com/bup/bup? It's a git file format based backup incremental backup tool.

Do you also use rolling checksums (like bup) to prevent re-storing data when only a few bytes have changed?

--> What are the major use cases you imagine for noms?

I've read the section on Github, but can you give some specific examples that you see as good use cases?

aboodman · on Aug 2, 2016

Yes, we credited bup in various places, such as the design overview:

https://github.com/attic-labs/noms/blob/master/doc/intro.md

We were also heavily influenced by camlistore (which I hacked on for awhile), irmin, ipfs, and others who have done a lot of interesting work in this space.

We do use rolling checksums, but I think we have done some novel work here: https://github.com/attic-labs/noms/blob/master/doc/intro.md#...

pinko · on Aug 2, 2016

FYI, the "HTTP protocol" link[1] in the intro page above is broken.

PS: I love the name "Prolly Tree".

[1] <https://github.com/attic-labs/noms/blob/master/datas/databas...

aboodman · on Aug 2, 2016

:) Thanks, we like it too.

ripa · on Aug 2, 2016

bup is focused on backups. Would you say that noms is useful as a backup utility?

Do you have any mechanisms for securing integrity, specifically repairing the store in case of inconsistencies?

Is there any plans to support any data retention policy/functionality?

aboodman · on Aug 2, 2016

Noms should be useful as a backup utility, but I'd say it's especially useful for backing up data which is not files. Think about backing up data which you only have access to via API.

You can take the JSON output of an API and drop it into Noms, then do the same thing tomorrow, and Noms will automatically deduplicate the data as well as give you a nice structured API to read and interact with it.

We have an example of this here: https://github.com/attic-labs/noms/tree/master/samples/js/fl... but it's not working atm due to a bug introduced right before launch. You can look at the code though.

pinko · on Aug 2, 2016

Another question to help me better understand the tool: who do you see as your competitors, technically? What do you see as viable alternatives to Noms, but worse? (Or better!) I do see that you were inspired by git, but clearly your use-cases are different.

aboodman · on Aug 2, 2016

Git is a competitor. It is fairly common to check data (e.g., csv or json files) into Git today.

However, this falls down pretty rapidly. In order to get reasonable diffs, the data has to be sorted, and line-oriented. Also Git just doesn't scale well to larger repos or individual objects.

Otherwise, we see the competitors as the way that people distribute data today - custom APIs, zip files full of CSV, etc.

jrsnyder · on Aug 2, 2016

How would you say Noms compares to Datomic[1]? Both projects are working on the same idea of representing a database as tree of commits over time.

From my quick inspection, it looks like Noms shows some focus towards working in multiple branches, whereas Datomic, at least in its marketing materials, just talks about preserving a single timeline.

[1]: http://www.datomic.com/benefits.html

aboodman · on Aug 2, 2016

https://news.ycombinator.com/item?id=12213886

carterehsmith · on Aug 3, 2016

> Also Git just doesn't scale well to larger repos or individual objects.

I guess the above means that Noms does scale to larger repos... Do you guys have any numbers, comparisons, benchmarks against git?

If so, it would be useful to include in the readme.md as it would be kind of a big thing, and quite attractive to many people.

alexdean · on Aug 2, 2016

Why would you exclusively support schema inference, rather than also allowing users to manually specify their schemas?

Schema inference is very difficult to do correctly and safely, especially with small initial samples of instances (source: work on https://github.com/snowplow/schema-guru).

aboodman · on Aug 3, 2016

There's a little bit of terminology overloading going on here.

In Noms every value has a type. It's an immutable system, so this type just is. The type of `42` is `Number`. The type of `"foobar"` is `String`. The type of `[42,44]` is `List<Number>`. And if you add "foo" to that list, the type becomes `List<Number|String>`.

We don't try to infer a general database schema from a few instances of data. We just apply this aggregation up the tree and report the result.

That all said, we do want to eventually add schema _validation_, by which I mean the ability to associate a type with a dataset and have the database enforce that any value committed to the dataset is compatible with that type (following subtyping rules).

kragen · on Aug 3, 2016

It sounds like Noms is dynamically typed, rather like SQLite — types are associated with values, not (just) with datasets. The difference is that SQLite (like Python or JS) only types leaf/atomic data, while you're also typing aggregate data. Is that right?

Are you planning on writing complete reference documentation at some point, like https://www.sqlite.org/limits.html, https://www.sqlite.org/howtocorrupt.html, https://www.sqlite.org/lang.html, https://docs.python.org/2/reference/index.html, and https://golang.org/ref/spec? Or is using Noms going to be more of a UTSL kind of thing? The documentation I've found so far seems to be purely tutorial and introductory in nature.

(I'm really glad you're writing Noms, by the way. There's an enormous need for it.)

alexdean · on Aug 3, 2016

Right - the challenge is that with dynamic typing and without schema validation, it's incredibly easy to break any strongly typed client/consuming application. You think you are dealing with a `List<Number>`, you have Go/Java/Haskell/whatever apps which are consuming that in a strongly typed fashion using their idiomatic record types, and then suddenly a user accidentally sends in a single value which turns the tree of values into a `List<Number|String>`, and all your consuming apps break.

Given that schema validation ("does this instance match this type?") is simpler to implement than schema inference ("what is the type of this instance?"), it's surprising to me to deliver inference first...

aboodman · on Aug 3, 2016

i don't understand how we could have gone in the opposite direction.

Schema validation for us is just looking at the type requirements of the dataset and the type of the value and seeing if they are compatible. How can we do that without first knowing the type of the value?

amelius · on Aug 2, 2016

How does Noms compare to ipfs?

https://ipfs.io/

aboodman · on Aug 3, 2016

First, IPFS is awesome.

IPFS is essentially providing a globally decentralized filesystem. Noms is providing (or hopes to provide) a database.

By database, I mean:

  - small individual records
  - efficient queries, updates, and range scans
  - ability to support complex queries
  - ability to enforce structural data validity

These are all things that IPFS could eventually grow to support, but in order to do it, I think it would have to grow into or layer something like noms on top.

zimbatm · on Aug 2, 2016

Hi, is there any place where you talk about how merging of two different branches works?

ioquatix · on Aug 2, 2016

How would NOMS work for massive amounts of geo-temporal data? with many inserts and queries but few (zero) updates? Efficient queries on geo-temporal keys are useful.

sigzero · on Aug 2, 2016

There's a question about what you consider "large data" above.

bra-ket · on Aug 2, 2016

does noms understand sql and can it do joins?

daveloyall · on Aug 2, 2016

It doesn't support queries. It's a datastore, not a database. This is from their FAQ. https://github.com/attic-labs/noms/blob/master/doc/faq.md

fizzbatter · on Aug 2, 2016

I mean, by their definition it is a database, but i can understand your usage. Then again, they both say it is a database, and in your link, they say it "isn't quite there yet", so /shrug heh.

ah- · on Aug 2, 2016

It'd be fantastic to combine something like noms with https://arrow.apache.org/ and then use Spark/Drill/Impala to query it.

pinko · on Aug 2, 2016

The linked page says, as its last line, "many important features are not yet implemented including a query system."

Kinnard · on Aug 2, 2016

You probably want to prepend a "Show HN: "

aboodman · on Aug 2, 2016

Whoops -- @ahl says it's too late now.

tlb · on Aug 2, 2016

Added.

weaksauce · on Aug 2, 2016

perhaps dang will change it for you?

was_boring · on Aug 2, 2016

It's an interesting idea.

The HN title suggested it's a database, which made me really curious as I can finally stop using history tables (or wal logging, or the other myriad ways of seeing a point in time). However, that doesn't seem to be the case here?

That said, the idea of "git as a datastore" does seem akin to "blockchain as data verification". Combine those two ideas together, get PWC involved and you have multimillion dollar deals coming in for audit protection.

aeijdenberg · on Aug 2, 2016

I've been working on something pretty akin to what you describe, hosted verifiable data structures (logs and maps). Rather than Blockchain it uses the same data structures as Certificate Transparency to provide equivalent functionality. Would love to get some feedback if you had the time to look: https://www.continusec.com/

pinko · on Aug 2, 2016

Here's a relevant (albeit 4-year-old) StackExchange thread, "Is there a Git for data?":

http://opendata.stackexchange.com/questions/748/is-there-a-g...

abhishivsaxena · on Aug 2, 2016

I once wrote super simple `git in js` for data objects. Was less than a couple of hundred of lines.

But there's also - https://github.com/mirage/irmin

kragen · on Aug 3, 2016

I've been wanting something like Noms for a while. Prolly trees sound really promising.

In intro.md, you suggest, "If you wanted to find all the people of a particular age AND having a particular hair color, you could construct a second map having type Map<String, Set<Person>>, and intersect the two sets." In that case, how should I keep the two maps in sync? Do I need to atomically update the logic of all the instances of the application to modify both maps instead of just one? Or do I keep the second map (the hair color index) in a separate index database and update the index whenever I pull changes from a remote database? (What does the API look like for getting notified of new changes that haven't been indexed yet?)

I see that "noms sync" does both push and pull. Does that mean I can't pull data from a database I can't write to? How does that work over HTTP — do I need to use a special HTTP server that knows how to accept and authenticate write requests, or can I just dump a Noms dataset in a directory and serve it up with Apache?

Forgive me if these questions are obvious — I've read the docs I could find, but I haven't read any of the code beyond the hr sample.

aboodman · on Aug 3, 2016

> Do I need to atomically update the logic of all the instances of the application to modify both maps instead of just one? Or do I keep the second map (the hair color index) in a separate index database and update the index whenever I pull changes from a remote database? (What does the API look like for getting notified of new changes that haven't been indexed yet?)

Currently, you have to manually keep an index up to date. But keep in mind that internally this is what all databases are doing -- manually reflecting changes into indexes -- they just hide it from you.

Eventually, we imagine that there will be tools to declare indexes you want to maintain and we'd do it for you. Note that because Noms is good at diffing, calculating the changes that need to be re-indexed comes for free!

tombert · on Aug 2, 2016

I'm surprised you didn't use a functional language like Haskell or OCaml or Rust to do this, since the article talks about love for functional programming.

I'm not criticizing Go at all, it's just not really a functional language.

nathancahill · on Aug 2, 2016

Excellent! This has been on my "things to build someday" list for a while now. Excited to start playing with it.

DanWaterworth · on Aug 2, 2016

Same here, though in my case it was on my list of things to continue building.

paxcoder · on Aug 2, 2016

Pretty impressive work but seems like reinventing wheels. Why wasn't it built upon existing tech?

I think the docs should enumerate the most important differences and use cases for which it should be a better fit.

2bitencryption · on Aug 2, 2016

To play devil's advocate, Git "reinvented the wheel," but it was a much nicer wheel.

Not saying this is to databases what Git was to versioning, but there's a reason to strive for that.

paxcoder · on Aug 2, 2016

Git's author felt the alternative (gratis) systems were lacking. Noms's author, on the contrary, praises Git but doesn't build on it. He chooses to implement the same technology himself, and from the docs it's not clear to me why that is.

shykes · on Aug 2, 2016

Off the top of my head: git can only use sha1 which makes it unsuitable for any use case where you need to cryptographically verify the origin of data (so far nobody was able to tell me definitely how secure git signed commits and tags really are).

lilyball · on Aug 2, 2016

Assuming SHA1 has second pre-image resistance (which it currently still does), the security of git signed commits/tags is the same thing as the security of the private key used to sign the commits/tags.

fizzbatter · on Aug 2, 2016

This is really interesting! What are some ideal use cases for the current implementation? I've seen Git is considered a competitor, but Noms also appears to be a generic database, so i would just like to hear some basic use cases, if possible.

Eg: If used as a database, what applications would benefit from Noms? Could/should this be used for personal storage? Could/should this be used for code versioning (ie, Git)?

hibbelig · on Aug 2, 2016

The way I read it, git is not a competitor but rather an inspiration. They are taking ideas from git to apply them to a different domain.

fizzbatter · on Aug 2, 2016

Fwiw, i was referring to this: https://news.ycombinator.com/item?id=12212276

The author explicitly says Git is a competitor

robzyb · on Aug 3, 2016

Wow, this could be quite interesting.

Firstly, it would be cool if this could be a single gateway to "all the data in the world". Right now its a pain to find, say, energy generation statistics for, say, Portugal, but it would be great if I could do something like:

  noms get statistics.industry.energy.portugal.all();

Secondly, the versioning idea could have some really cool applications. For example, I work in data analytics, and sometimes I want to transform some data in an SQL table.

Doing transformations nicely is a bit difficult. Either I'm doing the calculations in a column of a view, with the associated performance hit, or I'm tacking columns onto the table, which quickly leads to a mess, especially during the initial stages of analyses.

It would be so cool if I could treat the database as a constantly-evolving git tree.

juol · on Aug 3, 2016

Your mascot looks like it giving an 'air' blowjob.

Otherwise looks like a cool project, keep up the good work!

czbond · on Aug 3, 2016

I could not stop laughing after I read this...

shruubi · on Aug 2, 2016

I really like the idea in theory, but seeing it in practice I feel the whole thing is too concerned with being a wrapper around git handling for their dataset files. I would much rather see diffs based around the records themselves, and not so much the structure of the data.

mikergray · on Aug 3, 2016

While Git is referenced as an inspiration, the implementation of Noms does not use Git. Noms performs diffs on the data - as records or whatever other structure you used in importing your data. CSV is but one example of a way to import data into Noms, but since so much data is available in that format it is an easy one to reference that most people know. Noms can also import JSON, XML and many other data types if you are willing to write JS or Go code (more to come). Thanks for taking a look at Noms!

phantom_oracle · on Aug 2, 2016

I don't want to downplay this idea, it really is nice to see people doing different/unique things with technology.

However, 1 question I have is:

Couldn't you just put a CSV/JSON file(s) behind VCS?

Eg. Drop my CSV/JSON file(s) onto github.com and then it will be version-controlled ?

aboodman · on Aug 2, 2016

You can, and people do that today. It has limitations though:

  * The data must be sorted in order for Git to provide good diffs
  * It does not scale very well. On my machine, Git refuses to diff files over 1GB (maybe there is a setting for that)
  * You must clone the entire repository onto your machine to work with it
  * There is no programmatic API -- you must work with the data and changes as text and line diffs

See https://www.youtube.com/watch?v=Zeg9CY3BMes for a little bit more on this topic.

chenster · on Aug 3, 2016

"...inspired by the elegance and power of Git for years.."

Definitely powerful, but elegance?

aboodman · on Aug 3, 2016

If you ever look into the internals of how Git works, it is beautiful. Yeah, the UI is kinda a mess, but the idea is inspired.

chenster · on Aug 4, 2016

"UI is kinda a mess" - amen.

woodcut · on Aug 2, 2016

We've been struggling managing a collection of periodically updated CSVs & binaries over a few GB's in size, we struggled with Git-LFS and gave up, and we were considering (dreading) SVN, this looks really promising. Cheers!

ah- · on Aug 2, 2016

Can you elaborate a bit on how the hashing and chunking works? There's a rolling hash for determining chunk boundaries, and also SHA-512/256 somewhere.

Does the same data chunked differently have a different hash?

aboodman · on Aug 2, 2016

Briefly, there are two main hash functions in use in noms.

sha-2 is used to compute the hash of individual chunks. This is the classic use of hashing in content-addressed systems.

We also use a rolling hash to compute chunk boundaries. We do this in the typical way that tools like bup, camlistore, rsync, and others do for large files.

But our observation was that if you squint your eyes, a merkle tree looks a little like a b-tree. So we use a rolling hash to break up huge lists, maps, and sets into trees where nodes are roughly 4KB. So it's a kind of self-balancing, probabilistic, deterministic b-tree thing.

aboodman · on Aug 2, 2016

We never chunk the same data differently. An inviolable rule of Noms is that the same logical value is always chunked the same way and always has the same hash.

If I start with integers 1-1000000 and you start with integers 0-999999, and we both make mutations to converge at the same list, we will end up with the exact same tree, with the exact same hashes.

This is what makes efficient synchronization and diff of noms data possible.

ah- · on Aug 2, 2016

Thanks. So it's building a hash tree with deterministic chunking and that means you can cheaply update the hash after updating parts of the tree as you only have to rehash certain bits?

Does that mean that your chunk sizes are kind of fixed? Do you think there's a way to retain that advantage and be able to coalesce smaller chunks into larger ones?

Say your smallest nodes are 4KB but for more efficient storage you might want to go up to 4MB chunks. Could that be done while retaining the same hash for the same underlying data?

bkalman · on Aug 2, 2016

Keep in mind (if this wasn't clear) that the chunks are only probabilistically 4K: https://github.com/attic-labs/noms/blob/master/go/types/roll.... I.e. the thing that's "fixed" here is the chunk size we're aiming for. The chunks themselves could be of any size.

In any case, that's a good question - we might want to do something about that down the line. But, if we did change that constant, the structure of the trees will change, and all[1] the hashes will change.

[1] a small number will stay the same

anilgulecha · on Aug 3, 2016

No ones mentioned this yet, but with good (mongo-like) query interface, this can add an important database to the offline-first movement.

(Right now pouchdb or gundb are the only available options.)

musicmatze · on Aug 5, 2016

This looks really interesting. I've been thinking about the problem of distributed issue tracking lately... and the set of sub-problems it has (authorization and authentification, synchronization and so on) ... I'm not sure all these problems could be covered by this, but I guess at least the "distributed"-part could be covered by something like this.

cdbattags · on Aug 2, 2016

I had an idea for this with a buddy in college after doing case study research into Git. I've always considered this the next step into a decentralized world outside of code and non-typed "text". I know .csv's where mentioned a few times; are you looking to narrow into a few specific file types for proof of concept?

erikarvidsson · on Aug 3, 2016

We have implemented a bunch of importers. One of them is CSV. Take a look at https://github.com/attic-labs/noms/tree/master/samples/go/cs...

We envision there to be tools that work on certain data types (Noms has a full type system), for example an app that displays all geo locations in a dataset.

billconan · on Aug 3, 2016

I'm curious about merging.

When there is a conflict, like when a file gets changed by different people, how merging is performed?

aboodman · on Aug 5, 2016

Not implemented yet, but here is the plan: https://github.com/attic-labs/noms/issues/148

kfk · on Aug 2, 2016

Those are exactly the kind of ideas the finance world needs to get out of its ethernal mess of spreadsheets.

nkohari · on Aug 2, 2016

This is really interesting, thanks for sharing it!

I haven't had a chance to dig into the code yet, but I notice that you say two replicas of the same database can be disconnected, altered, and then merged. Could you explain how Noms takes care of that, particularly in the case of collisions?

ianai · on Aug 2, 2016

This really piqued my interest and "next big thing" sense

pbkhrv · on Aug 3, 2016

Something like this could be used as a backing store for package managers like npm or apt or ruby gems or pypi.

sigi45 · on Aug 2, 2016

How do you handle hash collision?

aboodman · on Aug 2, 2016

We assume that within a given version of the database format, there will never be a collision. The chances of a sha2 collision are beyond astronomical, and if you can create one, there are better things to do with your time that bother us.

That said, hashes only get weaker over time. The chances of an md5 collision used to be astronomical, now they are not.

So it was important to us to have an escape hatch - a way to increase the strength of the hash we use over time.

That's why we built a format version into Noms from the beginning. Our design is predicated on the fact that within a given version of the format, there is a 1:1 correspondence between hashes and values. Every value has exactly one hash, and every hash encodes exactly one value.

In future versions of the format, we might change the hash function. In this situation, we'd need to import data from the old format to the new format, just like how you have to sometimes migrate traditional databases across versions.

DanWaterworth · on Aug 2, 2016

Same way git does?

mschaef · on Aug 4, 2016

First off... I'm excited to see this project. There's a lot of potential here and this looks like a good implementation of a nice concept. I have at least a bit of authority behind that statement, since a few years ago, I had the opportunity to build something similar (although smaller in ambition.) A couple things to think about:

* Type accretion - This doesn't change the fact that database clients need to be able to accept historical data formats if they need to access historical data. The schema can't be changed for the older data objects without changing the hashes for that data, so there's no way to do something like a schema migration would work in SQL. For simple schema changes like adding fields, this might not be so hard to deal with, but some changes will be structural in nature and change the relative paths between objects. (This adds complexity to the code of database clients, as well as testing effort.)

* Security - Is there a way to secure objects stored within noms? Let's say I store $SECRET into noms and get back a hash. Does it then become the case that every user with access to the database and the hash can now retrieve the $SECRET? What if permissions need to be granted or revoked to a particular object after it's been stored? A field within a particular object? What if an object shouldn't have been stored in the database at all and needs to be obliterated? (This last problem gets worse if the object to be obliterated contains the only path to data that needs to be retained.)

* Performance - The CAS model effectively takes the stored data, runs it through a blender, and returns you a grey goo of hashes...this is good for replication, but it means you can't get much meaningful information out of a hash. This tends to mean a lot of operations like you might find in an old-school navigational database, and a huge dependency on the time to fetch an object given a hash. Indices can help by reducing the complexity of the traversals you need to do, but only if they're current and you have the index you need.

* Data roll off - How do you expire off data so that it doesn't just monotonically increase in volume? Let's say there's an API to mark an object as purgeable, the problem of identifying other purgeable objects turns into effectively a garbage collection process. (git gc, etc.) There's also the issue of the sheer number of objects that can be involved. The system I was involved with had something like 500K objects/day that had to be purged after 120 days in the system. (Total of 60MM objects line and around 6TB or so) Identifying 500K objects to purge and then specifying those to the data layer for action is not necessarily an easy thing....

* Querying - Server side query logic (and an expression language) is basically essential to performance. Otherwise, you wind up with a network round trip for every edge of the graph you follow. Going back to my first point, whatever querying language is used has to be flexible enough to handle a schema that might be varying over time (through schema accretion).

All four of these bullet points are worthy of a great deal more discussion, and I haven't even broached issues around conflict resolution, differencing, UI concerns, etc. I think there are good approaches to managing lots of these issues, but there's a bunch of engineering involved, as well as some close attention to scope and goals...

aboodman · on Aug 5, 2016

- Type accretion: I don't think in general that schema changes like what happens in sql databases works very well (I say this having worked on such systems). In big systems, it's hard to get everyone to agree on a moment to CHANGE THE SCHEMA. You can certainly do something like that in Noms -- just write a new dataset and replace the old one. But being able to read old data and leave old clients working I think is powerful. Couple this with the structural typing that falls naturally out of Noms and - I think - you have a more flexible way to change schemas over time.

- Security: current thoughts: https://github.com/attic-labs/noms/issues/1183

- Perf: I'm not really following you here. CAS has some positives and some negatives for performance.

- expiration: 1. there are a huge number of systems today that never delete data. Taking advantage of that to make other operations faster makes sense. 2. yeah, it's a gc problem. luckily gc is a well-studied problem. Also, as Noms is a merkle tree and merkle trees are good at diff, we have some additional leverage. We don't need to do a full scan everytime.

- querying: disagree that it is essential to perf. Another option is to have a schema that matches your access model. You can do that server-side in addition (or instead) of having a query language.

===

It sounds like you have thought a lot about all of this! If you are interested, your brain would be very appreciated in the github or slack.

mschaef · on Aug 8, 2016

> It sounds like you have thought a lot about all of this!

Up until around 2014, I was heavily involved in the construction of a small CAS (100MM objects online, around 5-6TB in size) for a client that needed to replicate certain periodic calculations in a reliable way. It worked well, but something like noms would have eliminated the need for a bunch of custom work.

> If you are interested, your brain would be very appreciated in the github or slack.

I'll take a look... thanks for the invite!

jeeceebees · on Aug 3, 2016

[flagged]

dang · on Aug 4, 2016

Not here please.

We detached this comment from https://news.ycombinator.com/item?id=12211882 and marked it off-topic.

rejschaap · on Aug 3, 2016

Interesting project, would just like to say that the Git workflow isn't that great and CVS isn't that bad.

The Git workflow is quite complicated and will probably not appeal to people who typically just use Excel for everything.

It is true that CVS is messy, but its strength is that it is really simple, and it can easily be fixed.

Also, CVS can be versioned with Git quite well in many cases.