Not that many of the complaints aren't reasonable, but I thought that in general compression/format was orthogonal to parity, which is what I assume is actually wanted for long-term archiving? I always figured that the goal should normally to be able to get back out a bit-perfect copy of whatever went in, using something like Parchive at the file level or ZFS for online storage at the fs level. I guess on the principle of layers and graceful failure modes it's better if even sub-archives can handle some level of corruption without total failure, and from a long term perspective of implementation independence simpler/better specified is preferable, but that still doesn't seem to substitute for just having enough parity built in to both notice corruption and fully recover from it to fairly extreme levels.
I think with archiving it’s more than that. Sure you can guarantee that the actual tool you just compressed with can restore the original perfectly. But with long term digital archiving I think you need the assurance that the “spec” called “xz” could be perfectly reimplemented by an expert in the future. Based solely on documentation. And on a platform that doesn’t exist today. That is, you must assume the original executable is either not available or not able to be executed.
Why would they need to recreate it solely based on 'documentation'? It is open source, the source code is the documentation. It seems just as likely that the source would survive as it is likely that some complete technical documentation would survive. Maybe they wouldn't be able to compile it (probably they would be able to compile it, I don't see why they wouldn't have some kind of computer emulator available), but it's better than any other kind of documentation you could provide.
The src/ tree of xz is 335k (compressed with gzip). If you are worried future digital historians won't be able to figure out the xz format, throw a copy of the gzip'd source onto every drive you store archives on, it would basically be free and would almost guarantee they would have a complete copy of exactly what they would need to decompress the files.
You're exhibiting shortsightedness when it comes to "source". If I give you some RPG [1] or maybe some ALGO 58 [2] source code are you going to just compile and run it no problem? How about some FLOW-MATIC [3]?
Yes, programming languages come and go, but I don't see how that matters. Some future historian will either have access to a working copy of xz or they will not. If they don't, and they want to implement it, having a copy of the source code is far better than anything else you could give them. Sure, future programming languages will be quite different, but humans will certainly be able to read and understand C code. If humanity has forgotten how to read C code (and lost all knowledge of it), how are they going to read this documentation you seem to prefer? Human languages come and go also..
Any argument you can make about historians being able to recover dead languages you can make the exact same argument for their ability to recover dead computer languages, and there is no better or more accurate specification than the actual code.
So let me add to my recommendation, in addition to a copy of the xz source code, include a plain text copy of any 'how to program in C' book, or just the wikipedia page for the C language. That is more than enough for them to construct a program that can decompress xz files, once they relearn how to read whatever long dead language the book is written in (Ancient Pre-Cataclysm Earth English for example).
> If humanity has forgotten how to read C code (and lost all knowledge of it), how are they going to read this documentation you seem to prefer?
Sure but are they going to remember something like, weird precedence rules (See: &), undefined behaviour, etc. Just because they want to reimplement a specific, small, program does not mean they want to relearn several languages. What you're saying could easily blow up from 'how to code C' to 'reading the GCC / Clang compiler source code to figure out how a specific UB was implemented, which the program in this specific case falls into', which I'm sure nobody wants to spend their weekend doing, implementing something like `xz` could simply be a midpoint in their destination, they don't want to spend weeks digging up COBOL. Have at least some consideration for the human element, jeez.
Documentation, specifically _mathematical_ documentation, is more fault tolerant than either psuedocode or actual code.
At any other time, I would agree with you, but where archivism is concerned, I do not.
Are you saying it would be easier to implement xz from mathematical documentation than from computer program? I don't think so. I tried (multiple times) to implement algorithms from "mathematical documentation" in academic papers, and it is usually very bad, there are always missing parts. If I had a choice, I'd choose ALGOL-58 over human-language description anytime.
>> What you're saying could easily blow up from 'how to code C' to 'reading the GCC / Clang compiler source code to figure out how a specific UB was implemented, which the program in this specific case falls into', which I'm sure nobody wants to spend their weekend doing
There will be many, many people that will gladly dig into the minutia and technical details of arcane hardware, especially when it means making progress towards filling in the historical record. This is already the case today, there is a working https://en.wikipedia.org/wiki/Colossus_computer reconstructed just because it was historically significant.
There are languages which achieve critical mass and stay, and languages which don't, and disappear.
RPG is still around, and IBM still sells it on their cloud. But the language is highly proprietary, so don't expect a cheap access to it.
ALGOL-58 is one of the languages which died; but ALGOL-68 is in the current debian repos, and would take under 30 seconds to install.
FLOW-MATIC has died, but COBOL is around and again, easily installable.
I think you are underestimating how much legacy software there is. For example, Fortran 77 is still actively used, and there are programs written in it every day. There is immense amount of programs written in C89. The support for those languages is likely to stay forever.
In general, I think this topic is very interesting. Imagine 1000 years have passed, and all the computers are running YEAR3000 architecture which is incompatible with all the software we have today. Archeologists discover a treasure trove of texts and binary files from 21th century internet. They know ASCII and English, but nothing else. What can they do?
You'd need to manually port this code to whatever language you are using now. But this should be doable -- the software has 6000 lines of very straightforward C89 code. It does not use any OS services, nor does it rely on UB or complex language features.
(2) Use it to boot Linux (the image is included in that webpage). This allows you to run Ubuntu from 2009 on your YEAR3000 architecture.
(3) If your archive contains repository snapshot from 2009 to your machine. You can now install and run all the 20th century software on your YEAR3000 computers. Congrats!
(4) The only thing missing is graphics support. Just run x11vnc (included in the Jaunty repo) over serial port (included in dmitry.gr's emulator). VNC protocol is simple and well specified.
... and that's how I'd bootstrap 20th century computing on 30th century infrastructure. Sure, it will take some effort, -- but this only needs to be done once, and running programs will be easy from there on.
I don't disagree with that at all! I did try to say (maybe unclearly) that having simpler more foolproof and failsafe layers at every level seems absolutely worth pursuing anyway, where possible. But I also wonder whether some of the common wisdom is from an age that is obsolete for many scenarios? Ie., in the 80s and 90s and even early 00s there was a lot more churn, practices were less standardized, computing time was more expensive, storage capacity was far more expensive, the ratio of software to data size was higher, etc. The latter seems to tie into "must assume the original executable is either not available or not able to be executed." For serious archiving, does it no make sense to just bundle in not merely the executable, but in some instances an entire environment? In a "fall of civilization" type scenario that may not be helpful vs a clear simple spec and ease of bootstrapping, but for situations where technological continuity is a limiting factor for some reason anyway is it safe to simply assume basics like at this point x86 will never go away as something that is at least virtualized? In my own experience there is a pretty clear cutoff date after which I can continue to run the entire environment in a VM.
Again this is shooting the breeze a bit, article is discussing a case where there should be the freedom to choose better formats. But for a lot of important archive material, including software itself, are we getting to the point where many long term archives should simply including everything necessary to deal with them in the present day as a container or VM image, which is then stored with a solid amount of parity and replication?
> many long term archives should simply including everything necessary to deal with them in the present day as a container or VM image
Unfortunately, any such image would presume you have access to the hardware, or it has low-level instruction sets/processor design baked in. Think how many PDP-11's are around today. And in terms of an archive it's only been 50 years since the PDP-11 was invented. That's a blink of an eye in terms of archival standards.
Why does it matter how many physical machines are alive? There are tons of emulators around. There is even one in Javascript, with an ability to load disk images as well.
The hard part is hardware - the drives go bad, the computers fail. But disks grow, and it is getting simpler and cheaper to store lots of data. As long as you keep copying the files to modern media every 10 years or so, you should no longer have anybdata loss.
(The only exception is proprietary data formats which cannot be opened except by original program which cannot be run in VM easily. Those should be avoided at all costs)