Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I work for a digital archives project and we're concerned about similar issues of provenance. Our way of dealing with this has been to structure our archives as JSON-formatted text in Git repositories, with binaries managed by git-annex.

Git uses hashes for everything: files are placed in .git/objects/ by their hash, each commit lists the hashes of files in the working directory, and each commit points to the hash(es) of its parent(s)

Using Git it's possible to verify the integrity of the entire repository and its history, making it impossible to tamper with without leaving traces. It's also possible to have multiple copies (clones) of the data and to verify that they are exactly the same.

If you have a copy of such a collection you can compare it against a copy held by the originating institution. If that becomes impossible for some reason, then if you can track down any of the original files you could prove that that portion of the data is correct, which lends trust to the rest of the collection.

Of course, it's possible this would not convince a die-hard trump supporter. /s



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: