Search code examples
pythonfileziptar

Tarfile/Zipfile extractall() changing filename of some files


Hello I am currently working on a tool that has to extract some .tar files.

It works great for the most part but I have one problem:

Some .tar and .zip files have names that include "illegal" characters (f.ex ":"). This program has to run on windows machines so I have to deal with this.

Is there a way I can change the name of some of the files in the extracted output if it contains a ":" or another illegal windows character.

My current implementation:

def read_zip(filepath, extractpath):
    with zipfile.ZipFile(filepath, 'r') as zfile:
        contains_bad_char = False
        for finfo in zfile.infolist():
            if ":" in finfo.filename:
                contains_bad_char = True
        if not contains_bad_char:
            zfile.extractall(path=extractpath)


def read_tar(filepath, extractpath):
    with tarfile.open(filepath, "r:gz") as tar:
        contains_bad_char = False
        for member in tar.getmembers():
            if ":" in member.name:
                contains_bad_char = True
        if not contains_bad_char:
            tar.extractall(path=extractpath)

So currently I am just ignoring these outputs all together, which is not ideal.

To describe better what I am asking for I can provide a small example:

file_with_files.tar -> small_file_1.txt
                    -> small_file_2.txt
                    -> annoying:file_1.txt
                    -> annoying:file_1.txt

Should extract to

file_with_files -> small_file_1.txt
                -> small_file_2.txt
                -> annoying_file_1.txt
                -> annoying_file_1.txt

Is the only solution to iterate over every fileobject in the compressed file and extract one by one or is there a more elegant solution?


Solution

  • According to [Python.Docs]: ZipFile.extract(member, path=None, pwd=None):

    On Windows illegal characters (:, <, >, |, ", ?, and *) replaced by underscore (_).

    So, things are already taken care of:

    >>> import os
    >>> import zipfile
    >>>
    >>> os.getcwd()
    'e:\\Work\\Dev\\StackOverflow\\q055340013'
    >>> os.listdir()
    ['arch.zip']
    >>>
    >>> zf = zipfile.ZipFile("arch.zip")
    >>> zf.namelist()
    ['file0.txt', 'file:1.txt']
    >>> zf.extractall()
    >>> zf.close()
    >>>
    >>> os.listdir()
    ['arch.zip', 'file0.txt', 'file_1.txt']
    

    A quick browse over TarFile (source and doc) didn't reveal anything similar (and I wouldn't be very surprised if there wasn't, as .tar format is mainly used on Nix), so you'd have to do it manually. Things aren't as simple as I expected, since TarFile doesn't offer the possibility of extracting a member under a different name, like ZipFile does.
    Anyway, here's a piece of code (I had ZipFile and TarFile as muses or sources of inspiration):

    code00.py:

    #!/usr/bin/env python
    
    import sys
    import os
    import tarfile
    
    
    def unpack_tar(filepath, extractpath=".", compression_flag="*"):
        win_illegal = ':<>|"?*'
        table = str.maketrans(win_illegal, '_' * len(win_illegal))
        with tarfile.open(filepath, "r:" + compression_flag) as tar:
            for member in tar.getmembers():
                #print(member, member.isdir(), member.name, member.path)
                #print(type(member))
                if member.isdir():
                    os.makedirs(member.path.translate(table), exist_ok=True)
                else:
                    with open(os.path.join(extractpath, member.path.translate(table)), "wb") as fout:
                        fout.write(tarfile.ExFileObject(tar, member).read())
    
    
    def main(*argv):
        unpack_tar("arch00.tar")
    
    
    if __name__ == "__main__":
        print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                       64 if sys.maxsize > 0x100000000 else 32, sys.platform))
        rc = main(*sys.argv[1:])
        print("\nDone.")
        sys.exit(rc)
    

    Note that the above code works for simple .tar files (with simple members, including directories).

    Submitted [Python.Bugs]: tarfile: handling Windows (path) illegal characters in archive member names.
    I don't know what its outcome is going to be, since I submitted a couple of issues (and also fixes for them) that were more serious (on my PoV), but for various reasons, they were rejected.