Via an SSH server I have access to a data set. This data set is divided into several files, each of which is named File1.xml.gz
, File2.xml.gz
, etc. ... The naming of these files is a bit misleading in two ways:
Since it is a folder, I assume that it is strictly speaking a .tar.gz
file, but this is not obvious from the name (it only says .gz
).
When you unzip them, you don't get File1.xml
etc. directly, but they all contain each a first (sub)folder (and nothing else), which in turn contains a second subfolder (and nothing else), this one a third subfolder (and nothing else) and this one finally contains the fourth subfolder, in which File1.xml
(and nothing else) is located.
I have sketched this in a picture of the folder structure:
It is exactly this file in the lowest level that I want to access.
My problem: I am not allowed to delete the (apparently superfluous) folders and there is hardly any space left on the server and the files are extremely large, so I can't just unpack them. Therefore I wanted to read in the contents of the files line by line.
I think I know how to find a file that is embedded in several subfolders:
for root, dirs, files in os.walk(directory, topdown=False):
for file in files:
if file.startswith('file') and file.endswith('.xml'):
# do something with file
And I know how to read a zipped file without explicitly unzipping it:
with gzip.open('path to file1.xml.gz', 'rt', encoding='utf-8') as file1:
for line in file1:
print(line)
But accessing a file that's in the sub-sub-sub-folder of a zipped folder? Is that possible?
Use tarfile, opening with mode "r|gz"
. Use next()
until you get to what you want, then extractfile()
on that member to return a buffered stream you can read from.
>>> import tarfile
>>> t = tarfile.open("file.gz","r|gz")
>>> t.next()
<TarInfo 'a' at 0x1044d3b38>
>>> t.next()
<TarInfo 'a/b' at 0x1044d39a8>
>>> t.next()
<TarInfo 'a/b/c' at 0x1044d38e0>
>>> t.next()
<TarInfo 'a/b/c/d' at 0x1044d3a70>
>>> m = t.next()
>>> m.name
'a/b/c/d/file'
>>> f = t.extractfile(m)
>>> f.readline()
b'this\n'
>>> f.readline()
b'is\n'
>>> f.readline()
b'a\n'
>>> f.readline()
b'test\n'
>>> f.readline()
b''