Search code examples
pythonxmlzippython-docx

zipfile.ZipFile extracts the wrong file


I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.

So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.

The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.

This is the code I am using in Jupyter notebook.

import shutil
import zipfile

# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')

# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
    zip_ref.extractall('data/extracted/')

# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')

# Read the txt file
with open('extracted/word/document.txt') as intxt:
    data = intxt.read()

This is the directory tree for the extracted zip archive for the first document.

data -
     1-docx
     2-zip
     3-extracted/-
                1-customXml/
                2-docProps/
                3-_rels
                4-[Content_Types].xml
                5-word/-document.txt

The 2nd document's directory tree should be as following

data -
     1-docx
     2-zip
     3-extracted/-
                1-customXml/
                2-docProps/
                3-_rels
                4-[Content_Types].xml
                5-word/-document.xml

But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.


Solution

  • I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.