> Those of us who benefit from tools like iNat should be looking seriously to the decentralized models being developed by the likes of Bluesky and Mastodon, because we can’t rely on any single organization to provide that benefit forever.
Makes perfect sense to me, and I would like to point you to the technology and ecosystem of "nanopublications": https://nanopub.net/
In a nutshell, nanopublications provide a decentralized infrastructure like Mastodon, but with focus on redundantly storing open data rather than on user ownership of personal data. Moreover nanopublications are basically snippets of knowledge graphs, so they resemble database entries and can be queried as such.
Thanks for your comments. First off: yes, most (perhaps all) of the applied methods are not novel, some of them have been around for a long time. We only claim novelty on how these existing methods are combined to solve the problem of data availability and integrity on the web.
Yes, the magnet URI scheme is highly related, and we probably should have referred to it in one way or another. However, there are crucial features that magnet links do not provide (as far as I know): you cannot generate a hash that represents content on a more abstract level than byte sequences (MIME types by themselves don't solve that problem), and you can also not have self-references. All of the features from our list of requirements are supported by some approaches, but (to our knowledge) no approach supports all of them at the same time.
In terms of search engines caching research data, I agree! We shouldn't trust existing providers too much but build a dedicated decentralized infrastructure for scientific purposes (this is what I am working on now).
I am sure the performance measures can be improved (incremental cryptography might allow us to get rid of sorting altogether). The shape of the curve is however not much affected by the fact whether the statements are already sorted or not (they are not sorted for TransformRdf and TransformLargeRdf!).
But, I don't think I understand your concern about abstract hashing and how it would need to be something fundamentally new. Both the order normalization and self-reference are simply preprocessing stages on your data, albeit slightly different forms. The sortedness requirement, I think, is captured by MIME type parameters (the "charset=" in "text/html;charset=UTF-8"), as it does not change the fact that the document is an RDF graph. For the placeholder trick, I think you're right and that you'd want something like a "text/rdf+selfref" MIME type to indicate that it is not in fact valid RDF until preprocessing has been performed. All told, your RDF module would be described in MIME as something like "text/rdf+selfref;sorted=".
Right, I guess you could define everything into a new MIME type, but I think that would be quite a weird thing to do and wouldn't really be faithful to the idea of MIME types. This MIME type would stand for a type that nobody would be directly using for files, but it would only stand for some internal intermediate representation (I will not be able to convince people using RDF to switch to my new strange format instead of TriG or N-Quads!). And that means that there would be two MIME types involved for a single file: the actual type (such as application/rdf+xml or application/trig) and then the type for normalization and hash calculation (something like "text/rdf+selfref;sorted="). I think this shows that MIME types are not a straightforward solution to the given problem and I think this justifies to introduce this new level and a new scheme for the trusty URI modules (e.g. "RA").
What happens to other URLs embedded in the document that you link to with trusty URLs (other than self-references?).
For example your document could include images, and javascript that could completely the meaning of the document, while keeping the hash of the document the same.
Do you require all URLs contained in the document to be trusty URIs too?
No, there can be contained URIs that are not trusty (probably the majority of contained URIs will be of that type). You can verify the entire reference tree as long as you follow trusty links, but of course this cannot go on indefinitely. Furthermore not all resources have the form of what I call a "digital artifact" (e.g. foaf:knows does not stand for a digital artifact), but they reach out to the real world (these URIs might not even return a representation, i.e. they might not be URLs).
Makes perfect sense to me, and I would like to point you to the technology and ecosystem of "nanopublications": https://nanopub.net/
In a nutshell, nanopublications provide a decentralized infrastructure like Mastodon, but with focus on redundantly storing open data rather than on user ownership of personal data. Moreover nanopublications are basically snippets of knowledge graphs, so they resemble database entries and can be queried as such.
Happy to elaborate if this is of interest.