Search code examples
pythonlzmatarfile

Extract lzma compressed tar archive members without writing to disk


I have a nested tar file 2 tars deep. The outermost tar is gpg encrypted and not compressed. The inner tar is lzma. Working with the innermost tar from disk I don't have any problems. Passing the inner most tar.xz file directly to with tarfile.open() as get_lzma works. The code following that line executes without error. I can extract the tar members and json.load() the data.

It's a small file, the data is sensitive. It has to sit on disk while I work with it so I don't want to decrypt it and extract the innermost tar to disk. So I'd like to access the members in memory. I can decrypt to the gpg file and for member in get_lzma.getmembers(): returns the tarinfo objects I'd expect, so the member appears to be there, I just can't do anything with it. When I run extractfile() I can't .read() the result as it returns <ExFileObject name=None>.

At this point I'm just curious as the why this isn't work.

In case the file structure is unclear this is how it's sitting on disk:

file.tar.gpg <- is a tar file
 file.tar.xz <- is a compressed tar file
   member1
   memberN
   json.load(file_o)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/tarfile.py", line 681, in read
    self.fileobj.seek(offset + (self.position - start))
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/lzma.py", line 253, in seek
    return self._buffer.seek(offset, whence)
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/_compression.py", line 143, in seek
    data = self.read(min(io.DEFAULT_BUFFER_SIZE, offset))
  File "/home/user/.pyenv/versions/3.8.2/lib/python3.8/_compression.py", line 103, in read
    data = self._decompressor.decompress(rawblock, size)
_lzma.LZMAError: Input format not supported by decoder
   with open(gpg_encrypted_tar_archive, 'rb') as f: 
        try:
            decrypted_data = gpg.decrypt_file(f, passphrase=passph)
            assert decrypted_data.ok
        except AssertionError:
            print(f"Decryption failed with message '{decrypted_data.status}' and status '{decrypted_data.ok}'")

        io_bytes_file_like_object = io.BytesIO(decrypted_data.data)

        # untar the parent archive
        tarfile.open(fileobj=io_bytes_file_like_object, mode='r')

        with tarfile.open(fileobj=io_bytes_file_like_object, mode='r:xz', debug=3, errorlevel=2) as get_lzma:

            for member in get_lzma.getmembers():

                if member.isfile():
                    file_o = get_lzma.extractfile(member)
                    json.load(file_o)

Solution

  • A day away from this issue seems to have cleared up my understanding of the problem. I'll explain my thinking and what I found.

    tldr I got my tarinfo and tarfile and exfileobjects all mixed up.

    As per OP the files passed to open() are represented as <_io.BufferedReader name='myfile.tar.xz'> under the covers. So getting the code to work with the unencrypted file and open(), really only proves that the file wasn't corrupted. So no issues there, I won't mention it again.

    Following the code example in the OP, calling decrypted_data.data returns a raw byte string from the decrypted_data object. These raw bytes are my 2 tar files. The uncompress tar and the compressed tar nested structure. The string starts b'myfile.tar.xz\x00\x00\x00\...... etc. We wrap this us as a bytes object io.BytesIO(decrypted_data.data) so that we have an interface to work with and so it can be passed to tarfile.open() initally. So far so good, but that's where things started to go wrong.

    I've decided to go with 2 context managers in the following code. In the OP I made a call to tarfile.open() twice, I guess I must have assumed that there was an extraction operation along with mode='r'. You can see that I now make 2 calls to extractfile() on the member's of each context manager. The first extractfile() extracting the tar.xz from the uncompressed tar. This is the ExFileObject "file" I eventually pass to the next context manager as mode=r:xz. In the OP it'd have been a TarInfo object, not the extracted data, which is wrong.

    The second call to extractfile() is done on member of the second context manager in order to get readable_members_value from myfile.tar.xz.

    with tarfile.open(fileobj=io_bytes_file_like_object, mode='r') as uncompressed_tar_file:
    
        # uncompressed_tar_file is tarfile.TarFile object
        for member in uncompressed_tar_file.getmembers():
    
          # member is TarInfo object
            tar_file_object = uncompressed_tar_file.extractfile(member)
    
            # tar_file_object is ExFileObject
            with tarfile.open(fileobj=tar_file_object, mode='r:xz', debug=3, errorlevel=2) as lzma_compressed_tar_file:
                for member in lzma_compressed_tar_file.getmembers():
                    if member.isfile():
                        readable_members_value = lzma_compressed_tar_file.extractfile(member)
    
                        # now works where it failed before equivalent to something like file_o.read() in OP
                        print(readable_members_value.read())
    
                        decoded_readable_member_value = readable_member_value.read().decode("utf-8")
                        json_data = json.loads(decoded_readable_member_value)
                        print(json_data)
    

    The rest of the code is pretty much identical apart from the last few lines. You can see by the variable names in OP I was expecting file_o to be a file object. json.load(file_o) would work against a file pointer but in this case readable_member.read() is returning bytes literal b'' so it's actually json.loads() that I need not json.load().