Search code examples
pythongziplarge-filesos.walktarfile

How can I access a file that is in a subfolder of a gzip-compressed folder without extracting everything?


Via an SSH server I have access to a data set. This data set is divided into several files, each of which is named File1.xml.gz, File2.xml.gz, etc. ... The naming of these files is a bit misleading in two ways:

  1. Since it is a folder, I assume that it is strictly speaking a .tar.gz file, but this is not obvious from the name (it only says .gz).

  2. When you unzip them, you don't get File1.xml etc. directly, but they all contain each a first (sub)folder (and nothing else), which in turn contains a second subfolder (and nothing else), this one a third subfolder (and nothing else) and this one finally contains the fourth subfolder, in which File1.xml (and nothing else) is located.

    I have sketched this in a picture of the folder structure:

    visualization of the folder structure

    It is exactly this file in the lowest level that I want to access.

My problem: I am not allowed to delete the (apparently superfluous) folders and there is hardly any space left on the server and the files are extremely large, so I can't just unpack them. Therefore I wanted to read in the contents of the files line by line.

I think I know how to find a file that is embedded in several subfolders:

for root, dirs, files in os.walk(directory, topdown=False):
    for file in files:
        if file.startswith('file') and file.endswith('.xml'):
            # do something with file

And I know how to read a zipped file without explicitly unzipping it:

with gzip.open('path to file1.xml.gz', 'rt', encoding='utf-8') as file1:
    for line in file1:
        print(line)

But accessing a file that's in the sub-sub-sub-folder of a zipped folder? Is that possible?


Solution

  • Use tarfile, opening with mode "r|gz". Use next() until you get to what you want, then extractfile() on that member to return a buffered stream you can read from.

    >>> import tarfile
    >>> t = tarfile.open("file.gz","r|gz")
    >>> t.next()
    <TarInfo 'a' at 0x1044d3b38>
    >>> t.next()
    <TarInfo 'a/b' at 0x1044d39a8>
    >>> t.next()
    <TarInfo 'a/b/c' at 0x1044d38e0>
    >>> t.next()
    <TarInfo 'a/b/c/d' at 0x1044d3a70>
    >>> m = t.next()
    >>> m.name
    'a/b/c/d/file'
    >>> f = t.extractfile(m)
    >>> f.readline()
    b'this\n'
    >>> f.readline()
    b'is\n'
    >>> f.readline()
    b'a\n'
    >>> f.readline()
    b'test\n'
    >>> f.readline()
    b''