Xz format inadequate for long-term archiving (2016)

londons_explore · on March 29, 2024

I think none of these issues really matter.

Sure, it isn't perfect. But all of those issues are liveable-with. Most of them involve handling corrupt data - but any serious archive will hash the whole file as part of the cataloging process.

The commonly-used technology is by far the best choice for long term archiving, because something that has a billion users will go obsolete/unreadable a long time after a fancy compressor written by a phd student in your lab.

If I were running an archive today, I would be keeping everything in .zip files, because there is a ~35 year window of computers that can open them, and I wouldn't be surprised if the format remains in common use for a further 35 years. That means that in the year 3000, someone wanting to access the data only needs to find/emulate a system from anytime between 1990 and 2060 to have a good chance of reading the data.

SrslyJosh · on March 29, 2024

> The commonly-used technology is by far the best choice for long term archiving, because something that has a billion users will go obsolete/unreadable a long time after a fancy compressor written by a phd student in your lab.

Did you read the article? `xz` is far more complex than the alternatives. Your analogy doesn't make sense in this scenario.

> If I were running an archive today, I would be keeping everything in .zip files

So...not xz. Okay.

londons_explore · on March 29, 2024

complexity doesn't matter really... You aren't going to be reading these with a hex editor. What matters is that decompression software is widely available today, and will be long into the future. While thats kinda true for .xz, it's far far more true for .zip.

magicalhippo · on March 29, 2024

From 2016, last updated in 2022, so way before the current xz backdoor debacle[1].

[1]: https://news.ycombinator.com/item?id=39865810

comex · on March 29, 2024

This is an old article. Here’s a comment I wrote back in 2018 explaining one major problem with it:

https://news.ycombinator.com/item?id=16889222

SrslyJosh · on March 29, 2024

I don't think that a problem with lzip invalidates all of the article's criticisms of xz.

lifthrasiir · on March 30, 2024

But at least you can see why you shouldn't use lzip in addition to xz, which has been my position for a long time.

lifthrasiir · on March 30, 2024

I can see why the lzip author frustrated at the xz format, especially given that there are so many checksums and paddings around, but the lzip format is the opposite extreme.

7-zip already had a concept of multiple filters which contribute to its efficiency, and the underlying design of xz does capture them without much complication. For example, filters in the original 7-zip format (or "codecs") can have both multiple input and output streams [1]. This makes less sense for a single file compressor and xz carefully avoided them. The main problem with the xz format is not its concept but more about its concrete implementation: you don't need extensibility, you only need agility.

In comparison, lzip is too minimal. It might be technically agile by its version field, but it wouldn't if you do nothing and claim that you are open to any addition. It is not hard to pick some filters and mandate only most useful combinations of filters. The stream could have been periodically interrupted to give an early chance to detect errors before the member footer. (Unless lzip natively produces a multimember file even for a single input, which is AFAIK not the case.) The lzip author claims that a corruption in the compressed data can be detected from the decompression process, but that would mean too much redundancy in the compressed data, so this claim has been clearly misguided. And what the heck is that dictionary size coding? Compressed formats frequently make use of exponent-mantissa encodings but I have never seen an encoding where the mantissa is subtracted.

Of course, both should be avoided at this point because zstd is fast and efficient enough. Also, the file format for zstd is better than both in my opinion.

[1] https://py7zr.readthedocs.io/en/latest/archive_format.html#c...

lifthrasiir · on March 30, 2024

For the record, the following is a concise summary of .xz file format:

  Everything is little endian.

  The file is a concatenated list of streams.

  vu63 is a variable-length encoding for u63:
    u8[1..9] where all but last byte has an MSB set.
    Remaining 7-bit blocks are interpreted in little endian.
    No overlong representation allowed, so `00` is valid but `80 00` isn't.

  Stream format is:
    u8[6] magic bytes, `FD 37 7A 58 5A 00`.
    u8[*] stream header:
      u16 stream flags:
        Bits 8..11 are check types and sizes:
          0 = No check, 1 = CRC32, 4 = CRC64, 10 = SHA-256.
          Other values are reserved but their lengths are guaranteed to be
            4 bytes for 1..3, 8 bytes for 4..6, ..., 64 bytes for 13..15.
        Other bits are reserved and should be zero.
    u32 CRC32 checksum for the stream header.
    u8[*] zero or more blocks.
    u8[*] single index, distinguished from the block by the first byte.
    u32 CRC32 checksum for the following stream footer.
    u8[*] stream footer:
      u32 backward size, equals to (index size / 4 - 1).
      u16 a copy of stream flags.
    u8[2] magic bytes, `59 5A`.
    u8[0..3] padding bytes to make the stream size a multiple of 4 bytes.
    Each stream, before and after compression, should be at most 2^63-1 bytes long.

  Block format is:
    u8[*] block header:
      u8 block header size / 4 - 1, shouldn't be zero.
      u8 block flags:
        Bits 0..1 are the number of filters - 1.
        Bit 6 is set if the compressed size is present.
        Bit 7 is set if the uncompressed size is present.
        Other bits are reserved and should be zero.
      Optional vu8 compressed size of this block, shouldn't be zero.
      Optional vu8 uncompressed size of this block.
      For each filter:
        vu8 filter ID.
        vu8 size of filter properties.
        u8[*] filter properties depending on the filter.
        Each filter is applied in this order to the uncompressed data.
      u8[*] padding bytes to make the block header size a multiple of 4 bytes.
        Padding may require >3 bytes, since the minimum block header size is 8 bytes.
      u32 CRC32 checksum for the block header except for this field.
    u8[*] compressed data.
    u8[0..3] padding bytes to make the block size a multiple of 4 bytes.
    u8[*] check bytes, derived from the stream flags and uncompressed block bytes.

  Index format is:
    u8 magic byte, `00`, which distinguishes indices from blocks.
    vu8 number of blocks in this stream.
    For each block:
      vu8 unpadded size, equals to the block size minus the number of block padding bytes.
        This does NOT exclude other padding bytes in the block header, for example.
      vu8 uncompressed size.
    u8[0..3] padding bytes to make the index size a multiple of 4 bytes.
    u32 CRC32 checksum for the index except for this field.
    Each index should be at most 2^34 bytes long.

  Filter 0x21 indicates LZMA2 and can only be the last filter.
  Its filter properties are:
    u8 flag:
      Bits 0..5 are encoded dictionary size K, where the true size is:
        (2 + K % 2) * 2^(11 + floor(K / 2)) bytes for K < 40. (4 KiB through 3072 MiB)
        4096 MiB - 1 byte for K = 40.
        Other values are reserved.
      Other bits are reserved and should be zero.

  Filters 0x04..0x09 indicate a BCJ filter for x86, PowerPC, IA-64, ARM, ARM Thumb, SPARC
    respectively, and cannot be the last filter.
  Their filter properties are:
    Optional u32 start offset for this input, zero if omitted.

  Filter 0x03 indicates the Delta filter and cannot be the last filter.
  Its filter properties are:
    u8 delta distance - 1.

In comparison, the following is a concise summary of .lz file format:

  Everything is little endian.

  The file is a concatenated list of members.

  Member format is:
    u8[4] magic bytes, `4C 5A 49 50`.
    u8 version, currently 1.
    u8 coded dictionary size K, where the true size is (16 - floor(K / 32)) * 2^(8 + K % 32) bytes.
      The true size should range from 4 KiB (K = 0x00) to 512 MiB (K = 0x11).
      Note that this mapping is not increasing, so the valid range for K is not consecutive.
    u8[*] LZMA stream, which uses a particular version of LZMA and a custom marker at the end.
    u32 CRC32 checksum for the original uncompressed data.
    u64 uncompressed size.
    u64 member size.
    Each member should be at most 2^51 bytes long.
    Each member, after decomperssion, should be at most 2^64 - 1 bytes long.

And the following is a concise summary of .zst file format:

  Everything is little endian.

  The file is a concatenated list of frames.

  Zstandard frame format is:
    u8[4] magic bytes, `28 B5 2F FD`.
    u8 frame header descriptor:
      Bits 0..1 are the dictionary ID size (1=1, 2=2, 3=4), or zero if no dictionary ID is given.
      Bit 2 is set if the content checksum is present.
      Bit 3 is reserved and should be zero.
      Bit 4 is reserved and should be zero, but should be ignored by decoders.
      Bit 5 is set if the single segment restriction applies and window size is implicit.
      Bits 6..7 are the size for the frame content size field (0=1, 1=2, 2=4, 3=8).
        Zero indicates an absence of the frame content size instead if bit 5 wasn't set.
    Optional u8 window descriptor W, where the true window size is:
      Implicitly set to the frame content size if the single segment restriction applies.
      (8 + W % 8) * 2^(7 + floor(W / 8)) bytes otherwise. (1 KiB through 3.75 TiB)
    Optional u8[*] dictionary ID, where zero means no dictionary.
    Optional u8[*] frame content size.
    For each block:
      u24 block header:
        Bit 0 is set if this frame is the last.
        Bits 1..2 are the block type (0=raw, 1=RLE, 2=compressed, 3=reserved).
        Bits 3..23 are the block size.
          The block size should not exceed the block maximum size = min(window size, 128 KiB).
      If this block is RLE:
        u8 byte, which is repeated for the specified block size.
      Otherwise:
        u8[*] compressed data, which size is the specified block size.
    Optional u32 lower 32 bits of XXH64 content checksum for the uncompressed data, w/ seed=0.

  Skippable frame format is:
    u8[4] magic bytes, a single byte 50..5F followed by `2A 4D 18`.
    u32 the size of the following data.
    u8[*] data.