Search code examples
pythonpython-3.xgziptarfile

Uncompressing .gz files and storing them in a .tar.gz archive


I have the following problem: I am writing a function that looks for a bunch of .gz files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:

import gzip
import os
import pathlib
import tarfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
    with tarfile.open(output_file, 'w:gz') as tar:
        for input_file in input_dir.glob('*.gz'):
            with gzip.open(input_file) as fd:
                tar_info = tarfile.TarInfo(input_file.stem)
                tar_info.size = fd.seek(0, os.SEEK_END)
                fd.seek(0, os.SEEK_SET)
                tar.addfile(tar_info, fd)

I tried to create a TarInfo object the following way instead of manually creating it:

tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)

However, this functions retrieves the path of the original .gz file we opened as fd to compute its size, and thus only provides a tar_info.size parameter corresponding to the compressed .gz data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size parameter at all doesn't work either because addfile uses said size when passed a file descriptor.

Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?


Solution

  • Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip files don't really know their own decompressed size. The ISIZE header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE. Regardless, Python doesn't expose ISIZE, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).

    If you want to avoid decompressing the file twice (once to seek forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile to avoid double decompression (without needing to store the original file in memory) with a slight tweak:

    import shutil
    import tempfile
    
    def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
        with tarfile.open(output_file, 'w:gz') as tar:
            for input_file in input_dir.glob('*.gz'):
                with tempfile.TemporaryFile() as tf:
                    # Could combine both in one with, but this way we close the gzip
                    # file ASAP
                    with gzip.open(input_file) as fd:
                        shutil.copyfileobj(fd, tf)
                    tar_info = tarfile.TarInfo(input_file.stem)
                    tar_info.size = tf.tell()
                    tf.seek(0)
                    tar.addfile(tar_info, tf)