Search code examples
storagearchivefile-formatreproducible-researchxz

Safety of xz archive format


While looking for a good option to store large amounts of data (coming mostly from numerical computations) long-term, I arrived at using xz archive format (tar.xz). The default LZMA compression there provides significantly better archive sizes (for my type of data) compared to more common tar.gz (both with reasonable compression options).

However, the first google search on the safety of long-term usage of xz, arrived at the following web-page (coming from one of the developers of lzip) that has a title

Xz format inadequate for long-term archiving

listing several reasons, including:

  • xz being a container format as opposed to simple compressed data preceded by a necessary header
  • xz format fragmentation
  • unreasonable extensibility
  • poor header design and lack of field length protection
  • 4-byte alignment and use of padding all over the place
  • inability to add the trailing data to already created archive
  • multiple issues with xz error detection
  • no options for data recovery

While some of the concerns seem a bit artificial, I wonder, if there is any solid justification for not using xz as an archive format for long-term archiving.

What should I be concerned about if I choose xz as a file format? (I guess, access to an xz program itself should not be an issue even 30 years from now)

a couple of notes:

  • The data stored are results of numerical computations, some of which are published in different conferences and journals. And while storing results does not necessarily imply research reproducibility, it is an important component.
  • While using more standard tar.gz or even plain zip might be a more obvious choice, an ability to cut about 30% of the archive size is very appealing to me.

Solution

  • If integrity and redundancy are provided at a different level (for example, the filesystem), I don't see any real arguments against using xz, as it provides much better compression than zip or tar.gz.

    Many arguments can be refuted quite easily:

    Eg who cares if the format has scope for 2^63 extensions. That's just because the author used int64_t as a data type - it doesn't mean there WILL be that many, simply that they chose a large data type.

    Variable length integers are just fine too. They don't cause problems (if protected behind checksums) and lead to smaller files. Why wouldn't you want to use such a thing? It can indeed cause framing errors where failure to decode one field also causes the next to fail, but welcome to compression! That's true of almost every stream and this is why checksums matter.

    A corrupted message length field (2.5 Xz fails to protect the length of variable size fields) would point to a wrong CRC and therefore would become apparent as a CRC mismatch, and hence an integrity failure detected (with high probability).

    The most important argument would be the inaccuracy of decompressed data in section 2.10.4 The 'Block Check' field. However, there is no reason given why, for example, "SHA-256 provides worse accuracy than CRC64 for all possible block sizes", as the formula is not explained. Although SHA-256 is not designed primarily for error detection, it provides at least security against collisions depending on the length of the hash:

    The thing to remember is that, unlike a CRC where certain types of input are more or less likely to result in a collision (with certain types of input having a 0% chance of causing a collision), the actual probability of collisions for input to a cryptographic hash is a function of only the length of the hash.

    However, the probability of a collision of SHA-256 is below 2^-128, so even if taken into account all possible values of a 1 GB file (1,073,741,824 bytes = 2^30 bytes = 2^30 * 8 bits = 2^30 * 2^3 bits = 2^33 bits) leaves a security margin of 95 bits (probability of a collision 2^-95 = 10^-(95 * ln(2) / ln(10)) = 10^-28.6), which is quite good, and much less than the 3*10^-8 shown in the graph.

    [Update: The author of the paper explains this here, pointing out that for archiving purposes the optimal solution is a tradeoff between integrity and availability, which is reasonable and conforms to current scientific literature [Koopman, p. 33].]

    If data was compressed using xz with the default settings (compression preset level 6 etc.), it should be possible to decompress it in the future without problems.

    Having said this, using lzip -9 might indeed be the better solution. While the problems mentioned in the paper may occur rarely, they might occur. The lzip concept at the first glance looks more convincing, and using the highest compression level, lzip -9 yields better results than xz -9 and zstd -19 (I used barcode-0.99.tar and calgary.tar; see also Lzip compresses tarballs more than xz.)

    The Reed–Solomon error correction employed in par2 for providing redundancy and integrity is mainly used for transmission (radio, TV) and offline data (CDs). For online storage (hard disk drives) I would prefer the ZFS filesystem with mirror/parity disks for redundancy and scrubbing on a regular basis for integrity, plus an offline copy (backup).