Xz format inadequate for long-term archiving (2016)

jordigh · on June 5, 2019

This guy used to go around GNU mailing lists (and others) trying to get us to use lzip.

https://gcc.gnu.org/ml/gcc/2017-06/msg00044.html

https://lists.debian.org/debian-devel/2017/06/msg00433.html

It was a bit bizarre when he hit the Octave mailing list.

Eventually, people just wanted xz back:

http://octave.1599824.n4.nabble.com/opinion-bring-back-Octav...

AdmiralAsshat · on June 5, 2019

Everyone's gotta have their white whale, I guess.

brianpgordon · on June 5, 2019

Previous discussions:

https://news.ycombinator.com/item?id=12768425

https://news.ycombinator.com/item?id=16884832

esaym · on June 5, 2019

Interestingly, since "recovery" is mentioned several times, I decided to test myself.

I took a copy of a jpeg image, compressed it different times with either gzip or bzip2, then with a hexeditor modified one byte.

The recovery instructions for gzip is to simply do "zcat corrupt_file.gz > corrupt_file". While for bzip2 is to use the bzip2recover command which just dumps the blocks out individually (corrupt ones and all).

Uncompressing the corrupt gzip jpeg file via zcat at all times resulted in an image file the same size as the original and could be opened with any image viewer although the colors were clearly off.

I never could recover the image compressed with bzip2. Trying to extract all the recovered blocks made by bzip2recover via bzcat would just choke on the single corrupted block. And the smallest you can make a block is 100K (vs 32K for gzip?). Obviously pulling 100K out of a jpeg will not work.

Though I'm still confused as to how the corrupted gzip file extracted to a file of the same size as the original. I guess gzip writes out the corrupted data as well instead of choking on it? I guess gzip is the winner here. Having a file with a corrupted byte is much better than having a file with 100K of data missing...

gmueckl · on June 5, 2019

Your method is clearly flawed. Altering a single byte once is insufficient as a test unless you analyzed the structure of the compressed file first to see where the really important information is stored. It may well be that you just modified a verbatim string from the source data in the gzip case, but corrupted a bit of metadata about how the compressed data is structured in the bzip2 case. If you tried a different random bytes, the results might be reversed.

The proper test would be to iterate over every bit in the compressed file, flip it and try to recover. Then compute number of successful recoveries against the number of bits tested. Compression algorithms that perform similarly should gmhave similar likelyhoods that a single bit flip corrupts the entirety of the data.

esaym · on June 5, 2019

I thought about that as well. I tried it three different times all with the same results.

dual_basis · on June 5, 2019

Three? Well then, case closed!

fao_ · on June 6, 2019

Did the poster imply that their test was the be-all and end-all of error tolerance in common-use compressions systems? No. Then why did you assume that they did say that, and then write such a useless comment

wereHamster · on June 5, 2019

Whether recovery leads to (almost) useable data depends on what byte you modify. It's entirely possible that a single corrupt byte in the compressed data leads to a single corrupt byte when uncompressed. When you are dealing with images you may not even notice that a single pixel is wrong. But it's also possible that you completely destroy the data such that the decompression algorithm can't even deal with it and has to give up.

chasil · on June 5, 2019

A decade and a half ago, I wrote an Oracle archived log that I had compressed with bzip2 to a DLT40 tape.

I recovered and uncompressed (without error) the log, then tried to apply it to a database recovery which rejected it as corrupt.

After several attempts to read the tape (amounting to dozens of hours), I finally put it in the original drive that wrote it and pulled the file to the remote recovery system - this worked.

I immediately began including PAR2 files on the tapes, so the restored contents could be verified and corrected.

I have my doubts that bzip2 is as sensitive to corruption as the author of asserts, but perhaps there have been improvements to the code since my misfortune.

xoa · on June 5, 2019

Not that many of the complaints aren't reasonable, but I thought that in general compression/format was orthogonal to parity, which is what I assume is actually wanted for long-term archiving? I always figured that the goal should normally to be able to get back out a bit-perfect copy of whatever went in, using something like Parchive at the file level or ZFS for online storage at the fs level. I guess on the principle of layers and graceful failure modes it's better if even sub-archives can handle some level of corruption without total failure, and from a long term perspective of implementation independence simpler/better specified is preferable, but that still doesn't seem to substitute for just having enough parity built in to both notice corruption and fully recover from it to fairly extreme levels.

matt-attack · on June 5, 2019

I think with archiving it’s more than that. Sure you can guarantee that the actual tool you just compressed with can restore the original perfectly. But with long term digital archiving I think you need the assurance that the “spec” called “xz” could be perfectly reimplemented by an expert in the future. Based solely on documentation. And on a platform that doesn’t exist today. That is, you must assume the original executable is either not available or not able to be executed.

ltbarcly3 · on June 5, 2019

Why would they need to recreate it solely based on 'documentation'? It is open source, the source code is the documentation. It seems just as likely that the source would survive as it is likely that some complete technical documentation would survive. Maybe they wouldn't be able to compile it (probably they would be able to compile it, I don't see why they wouldn't have some kind of computer emulator available), but it's better than any other kind of documentation you could provide.

The src/ tree of xz is 335k (compressed with gzip). If you are worried future digital historians won't be able to figure out the xz format, throw a copy of the gzip'd source onto every drive you store archives on, it would basically be free and would almost guarantee they would have a complete copy of exactly what they would need to decompress the files.

matt-attack · on June 5, 2019

You're exhibiting shortsightedness when it comes to "source". If I give you some RPG [1] or maybe some ALGO 58 [2] source code are you going to just compile and run it no problem? How about some FLOW-MATIC [3]?

Point being that computer languages come and go.

[1] https://en.wikipedia.org/wiki/IBM_RPG

[2] https://en.wikipedia.org/wiki/ALGOL_58

[3] https://en.wikipedia.org/wiki/FLOW-MATIC

ltbarcly3 · on June 5, 2019

Yes, programming languages come and go, but I don't see how that matters. Some future historian will either have access to a working copy of xz or they will not. If they don't, and they want to implement it, having a copy of the source code is far better than anything else you could give them. Sure, future programming languages will be quite different, but humans will certainly be able to read and understand C code. If humanity has forgotten how to read C code (and lost all knowledge of it), how are they going to read this documentation you seem to prefer? Human languages come and go also..

https://en.wikipedia.org/wiki/Egyptian_hieroglyphs

https://en.wikipedia.org/wiki/Judaeo-Aragonese

https://en.wikipedia.org/wiki/Latin

Any argument you can make about historians being able to recover dead languages you can make the exact same argument for their ability to recover dead computer languages, and there is no better or more accurate specification than the actual code.

So let me add to my recommendation, in addition to a copy of the xz source code, include a plain text copy of any 'how to program in C' book, or just the wikipedia page for the C language. That is more than enough for them to construct a program that can decompress xz files, once they relearn how to read whatever long dead language the book is written in (Ancient Pre-Cataclysm Earth English for example).

fao_ · on June 6, 2019

> If humanity has forgotten how to read C code (and lost all knowledge of it), how are they going to read this documentation you seem to prefer?

Sure but are they going to remember something like, weird precedence rules (See: &), undefined behaviour, etc. Just because they want to reimplement a specific, small, program does not mean they want to relearn several languages. What you're saying could easily blow up from 'how to code C' to 'reading the GCC / Clang compiler source code to figure out how a specific UB was implemented, which the program in this specific case falls into', which I'm sure nobody wants to spend their weekend doing, implementing something like `xz` could simply be a midpoint in their destination, they don't want to spend weeks digging up COBOL. Have at least some consideration for the human element, jeez.

Documentation, specifically _mathematical_ documentation, is more fault tolerant than either psuedocode or actual code.

At any other time, I would agree with you, but where archivism is concerned, I do not.

theamk · on June 6, 2019

Are you saying it would be easier to implement xz from mathematical documentation than from computer program? I don't think so. I tried (multiple times) to implement algorithms from "mathematical documentation" in academic papers, and it is usually very bad, there are always missing parts. If I had a choice, I'd choose ALGOL-58 over human-language description anytime.

ltbarcly3 · on June 6, 2019

>> What you're saying could easily blow up from 'how to code C' to 'reading the GCC / Clang compiler source code to figure out how a specific UB was implemented, which the program in this specific case falls into', which I'm sure nobody wants to spend their weekend doing

There will be many, many people that will gladly dig into the minutia and technical details of arcane hardware, especially when it means making progress towards filling in the historical record. This is already the case today, there is a working https://en.wikipedia.org/wiki/Colossus_computer reconstructed just because it was historically significant.

fao_ · on June 6, 2019

I think you missed the implication that I didn't state explicitly, but figured was pretty clear:

> which I'm sure nobody wants to spend their weekend doing [if their original goal was to simply reconstruct xz].

theamk · on June 6, 2019

There are languages which achieve critical mass and stay, and languages which don't, and disappear.

RPG is still around, and IBM still sells it on their cloud. But the language is highly proprietary, so don't expect a cheap access to it.

ALGOL-58 is one of the languages which died; but ALGOL-68 is in the current debian repos, and would take under 30 seconds to install.

FLOW-MATIC has died, but COBOL is around and again, easily installable.

I think you are underestimating how much legacy software there is. For example, Fortran 77 is still actively used, and there are programs written in it every day. There is immense amount of programs written in C89. The support for those languages is likely to stay forever.

In general, I think this topic is very interesting. Imagine 1000 years have passed, and all the computers are running YEAR3000 architecture which is incompatible with all the software we have today. Archeologists discover a treasure trove of texts and binary files from 21th century internet. They know ASCII and English, but nothing else. What can they do?

The answer is surprisingly simple:

(1) Write an emulator for an simple CPU, like an ARMv5. Here is a good one: https://dmitry.gr/?r=05.Projects&proj=07.%20Linux%20on%208bi...

You'd need to manually port this code to whatever language you are using now. But this should be doable -- the software has 6000 lines of very straightforward C89 code. It does not use any OS services, nor does it rely on UB or complex language features.

(2) Use it to boot Linux (the image is included in that webpage). This allows you to run Ubuntu from 2009 on your YEAR3000 architecture.

(3) If your archive contains repository snapshot from 2009 to your machine. You can now install and run all the 20th century software on your YEAR3000 computers. Congrats!

(4) The only thing missing is graphics support. Just run x11vnc (included in the Jaunty repo) over serial port (included in dmitry.gr's emulator). VNC protocol is simple and well specified.

... and that's how I'd bootstrap 20th century computing on 30th century infrastructure. Sure, it will take some effort, -- but this only needs to be done once, and running programs will be easy from there on.

xoa · on June 5, 2019

I don't disagree with that at all! I did try to say (maybe unclearly) that having simpler more foolproof and failsafe layers at every level seems absolutely worth pursuing anyway, where possible. But I also wonder whether some of the common wisdom is from an age that is obsolete for many scenarios? Ie., in the 80s and 90s and even early 00s there was a lot more churn, practices were less standardized, computing time was more expensive, storage capacity was far more expensive, the ratio of software to data size was higher, etc. The latter seems to tie into "must assume the original executable is either not available or not able to be executed." For serious archiving, does it no make sense to just bundle in not merely the executable, but in some instances an entire environment? In a "fall of civilization" type scenario that may not be helpful vs a clear simple spec and ease of bootstrapping, but for situations where technological continuity is a limiting factor for some reason anyway is it safe to simply assume basics like at this point x86 will never go away as something that is at least virtualized? In my own experience there is a pretty clear cutoff date after which I can continue to run the entire environment in a VM.

Again this is shooting the breeze a bit, article is discussing a case where there should be the freedom to choose better formats. But for a lot of important archive material, including software itself, are we getting to the point where many long term archives should simply including everything necessary to deal with them in the present day as a container or VM image, which is then stored with a solid amount of parity and replication?

matt-attack · on June 6, 2019

> many long term archives should simply including everything necessary to deal with them in the present day as a container or VM image

Unfortunately, any such image would presume you have access to the hardware, or it has low-level instruction sets/processor design baked in. Think how many PDP-11's are around today. And in terms of an archive it's only been 50 years since the PDP-11 was invented. That's a blink of an eye in terms of archival standards.

theamk · on June 6, 2019

> Think how many PDP-11's are around today.

Why does it matter how many physical machines are alive? There are tons of emulators around. There is even one in Javascript, with an ability to load disk images as well.

The hard part is hardware - the drives go bad, the computers fail. But disks grow, and it is getting simpler and cheaper to store lots of data. As long as you keep copying the files to modern media every 10 years or so, you should no longer have anybdata loss.

(The only exception is proprietary data formats which cannot be opened except by original program which cannot be run in VM easily. Those should be avoided at all costs)

Adamantcheese · on June 5, 2019

How about something like ZPAQ instead for archiving? Especially if you're doing backups and not a lot of the information is changing.

ltbarcly3 · on June 5, 2019

No file format is perfect, I've been using xz for years and I can't think of a single issue I have had. The compression rate is dramatically better than gzip or bzip2 for many types of archives (especially when there is a large redundancy, for example when compressing spidered web pages from the same site you can get well over 99% size reduction compared to 70% reduction for gzip, which means using less than one 30th of the disk space).

Lately I have been using zstd for some things since it gives good compression and is much faster than xz.

This criticism of xz just seems nit picky and impractical, especially if you are compressing tar archives and/or storing the archives on some kind of raid which can correct some read errors (such as raid5).

asveikau · on June 5, 2019

I remember seeing this article before. This time the reaction that surges for me is: if you want long-term archiving but don't assume redundant storage, it's not going to go well. Put your long-term archives on ZFS.

stavros · on June 5, 2019

Why are you assuming they aren't assuming redundant storage? Redundant storage isn't a cure-all, there's still a chance two blocks on two disks will fail in the exact same spot.

asveikau · on June 5, 2019

Seems easier to increase the amount of disks and address it at a low layer than to re-engineer all layers, all file formats, for corruption.

mixmastamyk · on June 5, 2019

If there are safer free formats, why not use them? It's not like all data everywhere is always going to be stored on zfs.

gwern · on June 5, 2019

By the 'end to end principle', redundancy should probably be concentrated somewhere in the stack, and the rest of the stack should be concerned merely with validating integrity. It's unlikely that the optimum balance of resources and loss probability will entail redundancy at every level of the stack, from raw HDD bytes up to the global system level.

mixmastamyk · on June 5, 2019

Imagine files being moved from tape to disk to optical disc to NAS over the years. What now?

https://news.ycombinator.com/item?id=16886607

gwern · on June 5, 2019

You move them, then you verify the integrity of the new copy, then you can get rid of the old one. You don't need to build integrity checks and extra FEC at every level of the system redundantly.

Just like in the end-to-end principle when applied to networking: you have a single strong integrity check at the very furthest endpoint possible, and then you don't build in integrity & ECC at every level of the stack, you devote those resources to higher performance, and just do retransmission from the other endpoint when a file occasionally gets corrupted and the integrity check catches it.

mixmastamyk · on June 6, 2019

Appealing to "more manual work" is not very compelling to be honest.

gwern · on June 7, 2019

-_- I never appealed to 'more manual work', nor did I say that. If you refuse to understand my point and want to make up things I did not say, so be it.

mixmastamyk · on June 9, 2019

> you verify the integrity

gwern · on June 13, 2019

Don't be a moron. You knew perfectly well what I meant. (Did I also mean that 'you', a human, should be checking hashsums and FEC by hand for every network packet...?)

kipari · on June 5, 2019

I reckon that the chance of the same two blocks on two different disks failing between ZFS scrubs would be incredibly small.

rincebrain · on June 5, 2019

Assuming the corruption is independent, potentially, but A) even unlikely events are likely to happen for large enough N, and much more importantly, B) as another poster described, if you don't regularly check the integrity, and you have single-disk redundancy, losing a whole disk can likely result in you discovering a block that got mangled some time ago, too late to do anything about it.

There are a number of cases where failures might not be independent, though.

What if, say, you're using multiple drives of the same model, which have a firmware bug causing them to sometimes mangle data on the Nth sector?

What if you're using multiple drives from the same manufacturing batch which have a flaw leading to certain regions being more likely to fail than others?

What if you're using some battery-backed write cache under ZFS (from a HW RAID card or something more exotic), and it helpfully writes out garbage to the same sector on two disks?

What if you have a certain manufacturer's hard drives that lie about flushing their write cache successfully to disk if you issue a SMART request to them between when they put data in cache and when it actually gets to disk, so polling those two disks when they both just got a write results in data loss?

(The last of these is a real firmware bug I ran into - I was running a testbed of a bunch of raidz3 vdevs, and spent some time isolating when zpool scrub kept making the error counters increase even though it had corrected them all...thanks, Samsung HD204UI drives.)

cmurf · on June 5, 2019

It is incredibly small if you don't consider either drive failing. But if one drive fails, it happens with some regularity that a sector on the good drive is bad. In actuality, only one sector is bad, but in effect the dead drive means its mirror is also bad.

This comes up on the linux raid list with some frequency whenever there are drive failures with raid56, and the subsequently the raid trips over a single bad sector.

But it's true that lack of scrubbing contributes to this scenario, as well as the terrible combination of consumer drives with very high bad sector recovery times and the Linux SCSI command timer default of 30 seconds. That combination ends up causing a masking of bad sectors that end up not getting repaired, and as a user you may not realize that the link resets are not normal and suggest a bad sector as the cause.

zaphirplane · on June 5, 2019

Are you saying that a failure happens which isn’t detected and when the 2nd failure occurs we notice because the data is inaccessible?

Which raid s/w does this ?

cmurf · on June 6, 2019

Correct. All that depend on the SCSI block layer, which includes libata and thus common consumer SATA drives. A NAS or better drive will come out of the box with short error time outs, typically 70 deciseconds, and quickly issue a read error with the LBA of the offending bad sector, and the RAID can then know to obtain a copy or reconstruct from parity, write the good data to the bad sector thus fixing it. Either the write works, or if it fails the drive firmware is responsible for remapping that LBA to a reserve physical sector.

In the case where the drive error timeout is longer than the SCSI block layer, it just results in a link reset. The actual problem with the drive is obscured by the reset, including the bad sector, so it never gets repaired.

Btrfs, mdadm, lvm are affected and I'm pretty sure ZFS on Linux as well assuming they haven't totally reimplemented their own block layer outside of the SCSI subsystem.

It's a super irritating problem, the kernel developers know all about it, but thus far it's considered something distributions should change for the use cases that need it. And what that means so far is distros don't change it and users using consumer drives with high error recovery times, get bitten.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

zaphirplane · on June 6, 2019

The link you posted talks about the raid software kicking a whole disk out of the raid array when the disk takes too long to respond (basically but not exactly) due to 2 timeout variables mismatch

The post I was responding to implied a raid array could be degraded and you wouldn’t know till it completely failed

Interesting nevertheless

stavros · on June 5, 2019

Yes, over normal timescales. A lot can happen in a thousand years.

asveikau · on June 5, 2019

Thousands of years is a lot of scrubs and a lot of disk replacements, though. And a solution like ZFS, properly monitored, should help make those detections and repairs happen early, with lower odds of loss.

Although honestly in a thousand year timeframe I very much doubt humanity will preserve ZFS, gzip, tar, jpeg, PNG, ASCII, today's spoken and written languages in current form, etc. Just as written material from 1000 years ago is not very accessible to most people; with the original material you need intense study before you even know what you're looking at.

mkj · on June 6, 2019

A bit of speculation here, but perhaps xz won over lzip because it has a real manpage?

lzip has the usual infuriating short summary of options with a "run info lzip for the complete manual". Also the source code repository doesn't even seem linked directly from the lzip homepage - technical considerations aren't the only thing that determines if software is "better", it also has to be well presented.

shmerl · on June 5, 2019

xz-utils should implement parallel decompression already. pixz is doing it, but stock xz is not. Most end users benefit from faster decompression.

SEJeff · on June 5, 2019

This should have (2016) in the title.

Thoreandan · on June 5, 2019

and "from the author of lzip, a competing lzma library that never went viral".

Welcome to the Better Technology that Shoulda Made It bench. Your seat's over there next to OS/2, BeOS, and OpenGenera.

nkoren · on June 5, 2019

Amiga forever!!!!!!

zmix · on June 5, 2019

Ha! I just wanted to add this. But you did it first! :-)

Bender · on June 5, 2019

If you first use tar to preserve xattrs/etc.. then you can use anything to compress. xz, bz2, 7z, even arj if you are feeling nostalgic.

    tar cvfJ ./files.tar.xz /some/dir

zamalek · on June 5, 2019

You've missed the point of the article entirely. A single bit-flip (which is almost guaranteed over long-term) can easily render the entire xz file corrupt.

This has nothing to do with xattrs/etc.

Bender · on June 5, 2019

Yes, I am totally on auto-pilot today. I'm used to a different article that gets re-posted often about xz and my browser blocks non-https sites so I assumed it was that other article.

That said, I use xz in automation that compresses files on one end and decompresses on the other. I've not had any file corruption thus far. checksums always match. Hopefully the author has submitted bug reports and ways to reproduce.

imiric · on June 5, 2019

I've been using tar+pixz+par2 for backups for a while now, but this article still worries me.

What can I replace pixz with that compresses as well and keeps the indexing functionality? I'd like to avoid zstd because Facebook.

microcolonel · on June 5, 2019

> "3 Then, why some free software projects use xz?"

Because the files are usually smaller than gzip, with faster decompression than bzip2, and the library is available on most systems.

h1d · on June 5, 2019

Archiving for distribution and backups are very different things. You don't care if some app distribution compressed file gets corrupted, you just compress again but your compressed backup files usually don't have much source of reference.

I wouldn't use any unreliable format for backups. I picked bzip2 for stability and compression rate.

microcolonel · on June 5, 2019

In my opinion, the compressor is not the right place to add data integrity mechanisms, especially since data integrity mechanisms only really apply to particular media. Data on hard drives don't get corrupted in the same way as data on TLC SSDs, and generally on the latter you're better off with redundancy and diversification, than with inline error correcting codes.

Honestly, I don't see why xz should have any of its own data integrity mechanisms whatsoever, except maybe a whole-archive CRC32 or similar.

duskwuff · on June 5, 2019

Right. The purpose of an archiver/compressor is to store a bunch of files together, and use as little space as possible to do it. Data integrity / error correction / redundancy all lie in the opposite direction of that goal.