Search code examples
large-fileslibarchive

Untar files larger than 2GB (w/o using libarchive if possible)


I'm successfully using https://github.com/libarchive/libarchive/blob/master/contrib/untar.c as dependency-free code to read TAR files, but that code fails with a .tar file that contains a 10GB single file entry. It actually fails at the check-checksum stage, on the very first 512-byte block, of the first (and only) file entry. And the (octal-encoded, 12-bytes long) length expected at offset 124 seems to be garbage.

I can find very little info on TAR format for large files. Normally 12 octals can encode a 2^36 (64GB) file length, if I'm not mistaken, plenty enough for a 10GB entry, but obviously something more is at play here.

My (corporate) build environment does not allow use of libarchive at the moment, and would like to continue using ad-hoc code for now (see below why). Any info how the the encoding changes for files larger than 2GB in those header 512-bytes blocks? Any flags to check for extended headers or TAR variant? Any pointers to some doc on TAR specifically for the >2GB case? I didn't find any.

My use case if a little special, I want to decode the custom-binary-formatted files inside the (non-compressed) TAR on-the-fly, in a streaming fashion, recording offsets into those files (and thus the uncompressed archive) for later use. Ideally I'd memory-map the whole archive, streaming decode it, to discover the (inner) files within and streaming decode those, generating records (for further processing downstream) which do not copy but reference large chunks of the archive. This is a use-case that I suspect will be difficult to pull off using the libarchive API I'm seeing in the example. But which is easily doable if I have more control over the TAR decoding (like I do now for small file entries).

And looking at the libarchive code itself, in hope of finding more info, turned out to be rather difficult... I can't seem to make heads or tails of it. Any help would be appreciated.


Solution

  • Ironically, while having gone through the trouble of actually using libarchive to solve my issue, I discovered the man page for tar actually explains the format variants at a high level rather well... So it's possible I could have sorted this out myself after all. I might later.