How can I access a file that is in a subfolder of a gzip-compressed folder without extracting everything?

Via an SSH server I have access to a data set. This data set is divided into several files, each of which is named File1.xml.gz, File2.xml.gz, etc. ... The naming of these files is a bit misleading in two ways:

Since it is a folder, I assume that it is strictly speaking a .tar.gz file, but this is not obvious from the name (it only says .gz).
When you unzip them, you don't get File1.xml etc. directly, but they all contain each a first (sub)folder (and nothing else), which in turn contains a second subfolder (and nothing else), this one a third subfolder (and nothing else) and this one finally contains the fourth subfolder, in which File1.xml (and nothing else) is located.

I have sketched this in a picture of the folder structure:

It is exactly this file in the lowest level that I want to access.

My problem: I am not allowed to delete the (apparently superfluous) folders and there is hardly any space left on the server and the files are extremely large, so I can't just unpack them. Therefore I wanted to read in the contents of the files line by line.

I think I know how to find a file that is embedded in several subfolders:

for root, dirs, files in os.walk(directory, topdown=False):
    for file in files:
        if file.startswith('file') and file.endswith('.xml'):
            # do something with file

And I know how to read a zipped file without explicitly unzipping it:

with gzip.open('path to file1.xml.gz', 'rt', encoding='utf-8') as file1:
    for line in file1:
        print(line)

But accessing a file that's in the sub-sub-sub-folder of a zipped folder? Is that possible?

Solution

Use tarfile, opening with mode "r|gz". Use next() until you get to what you want, then extractfile() on that member to return a buffered stream you can read from.

>>> import tarfile
>>> t = tarfile.open("file.gz","r|gz")
>>> t.next()
<TarInfo 'a' at 0x1044d3b38>
>>> t.next()
<TarInfo 'a/b' at 0x1044d39a8>
>>> t.next()
<TarInfo 'a/b/c' at 0x1044d38e0>
>>> t.next()
<TarInfo 'a/b/c/d' at 0x1044d3a70>
>>> m = t.next()
>>> m.name
'a/b/c/d/file'
>>> f = t.extractfile(m)
>>> f.readline()
b'this\n'
>>> f.readline()
b'is\n'
>>> f.readline()
b'a\n'
>>> f.readline()
b'test\n'
>>> f.readline()
b''