I'm using python 2.7.6 on Windows and I'm using the tarfile module to extract a file a gzip file. The mode
option of tarfile.open()
is set to "r:gz"
. After the open call, if I were to print the contents of the archive via tarfile.list()
, I see the following directory in the list:
./静态分析 Part 1.v1/
However, after I call tarfile.extractall(), I don't see the above directory in the extracted list of files, instead I see this:
é™æ€åˆ†æž Part 1.v1/
If I were to extract the archive via 7zip, I see a directory with the same name as the first item above. So, clearly, the extractall() method is screwing up, but I don't know how to fix this.
I learned that tar doesn't retain the encoding information as part of the archive and treats filenames as raw byte sequences. So, the output I saw from tarfile.extractall()
was simply raw the character sequence that comprised the file's name prior to compression. In order to get the extractall()
method to recreate the original filenames, I discovered that you have to manually convert the members
of the TarFile
object to the appropriate encoding before calling extractall()
. In my case, the following did the trick:
modeltar = tarfile.open(zippath, mode="r:gz")
updatedMembers = []
for m in modeltar.getmembers():
m.name = unicode(m.name, 'utf-8')
updatedMembers.append(m)
modeltar.extractall(members=updatedMembers, path=dbpath)
The above code is based on this superuser answer: https://superuser.com/a/190786/354642