I have the following problem: I am writing a function that looks for a bunch of .gz
files, uncompresses them, and stores the individually uncompressed files in a bigger .tar.gz
archive. So far, I managed to implement it with the following code, but manually computing the uncompressed file size and setting the TarInfo
size seem rather hackish and I would like to know whether there is a more idiomatic solution to my problem:
import gzip
import os
import pathlib
import tarfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with gzip.open(input_file) as fd:
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = fd.seek(0, os.SEEK_END)
fd.seek(0, os.SEEK_SET)
tar.addfile(tar_info, fd)
I tried to create a TarInfo
object the following way instead of manually creating it:
tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)
However, this functions retrieves the path of the original .gz
file we opened as fd
to compute its size, and thus only provides a tar_info.size
parameter corresponding to the compressed .gz
data and not to the uncompressed data, which is not what I want. Not setting the tar_fino.size
parameter at all doesn't work either because addfile
uses said size when passed a file descriptor.
Is there a better, more idiomatic way to achieve this or am I stuck with my current solution?
Your approach is the only way to avoid decompressing the file completely to disk or RAM. After all, you need to know the size ahead of time to add to the tar file, and gzip
files don't really know their own decompressed size. The ISIZE
header field theoretically provides the decompressed size, but the field was defined back in the 32 bit days, so it's actually the size modulo 2**32
; a file originally 4 GB in size and one that was 0 B file would have the same ISIZE
. Regardless, Python doesn't expose ISIZE
, so even if it was useful, there would be no built-in way to do this (you can always muck about with manual parsing, but that's not exactly clean or idiomatic).
If you want to avoid decompressing the file twice (once to seek
forward, once to actually add it to the tar file), at the expense of decompressing it to disk, you can use a tempfile.TemporaryFile
to avoid double decompression (without needing to store the original file in memory) with a slight tweak:
import shutil
import tempfile
def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
with tarfile.open(output_file, 'w:gz') as tar:
for input_file in input_dir.glob('*.gz'):
with tempfile.TemporaryFile() as tf:
# Could combine both in one with, but this way we close the gzip
# file ASAP
with gzip.open(input_file) as fd:
shutil.copyfileobj(fd, tf)
tar_info = tarfile.TarInfo(input_file.stem)
tar_info.size = tf.tell()
tf.seek(0)
tar.addfile(tar_info, tf)