Search code examples
pythonwindowspython-2.7unicodetarfile

Unicode issues with tarfile.extractall() (Python 2.7)


I'm using python 2.7.6 on Windows and I'm using the tarfile module to extract a file a gzip file. The mode option of tarfile.open() is set to "r:gz". After the open call, if I were to print the contents of the archive via tarfile.list(), I see the following directory in the list:

./静态分析 Part 1.v1/

However, after I call tarfile.extractall(), I don't see the above directory in the extracted list of files, instead I see this:

é™æ€åˆ†æž Part 1.v1/

If I were to extract the archive via 7zip, I see a directory with the same name as the first item above. So, clearly, the extractall() method is screwing up, but I don't know how to fix this.


Solution

  • I learned that tar doesn't retain the encoding information as part of the archive and treats filenames as raw byte sequences. So, the output I saw from tarfile.extractall() was simply raw the character sequence that comprised the file's name prior to compression. In order to get the extractall() method to recreate the original filenames, I discovered that you have to manually convert the members of the TarFile object to the appropriate encoding before calling extractall(). In my case, the following did the trick:

      modeltar = tarfile.open(zippath, mode="r:gz")
      updatedMembers = []
      for m in modeltar.getmembers():
        m.name = unicode(m.name, 'utf-8')
        updatedMembers.append(m)
      modeltar.extractall(members=updatedMembers, path=dbpath)
    

    The above code is based on this superuser answer: https://superuser.com/a/190786/354642