Search code examples
pythonpython-docxpathlib

Iterate over pathlib paths and python-docx: zipfile.BadZipFile


My python skills are a bit rusty since I recently primarily used Rstats. However I ran into the following problem, my goal is that I want to recursively iterate over all .docx files in a directory and change some of the core attributes with the python-docx package.

For the loop, I first created a list with pathlib and glob

from docx import Document
from docx.shared import Inches
import pathlib

# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files

Output of files looks fine.

[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
 WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]

When I now want to read in a document with the list I get a zip error (see full traceback below)

document = Document(files[1])
Traceback (most recent call last):
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-482c5438fa33>", line 1, in <module>
    document = Document(files[1])
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
    self._zipf = ZipFile(pkg_file, 'r')
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

However just running the same line of code, without the list works fine (except for differences in the path separator / and r"\", which I thought should not matter due to the fact that the lists contains pathlib.Path objects).

document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))

Edit to Comment

I created a total of 4 new word files for this mre. Now I entered text in two of them and two are empty. And to my surprise I found out that the empty ones result in the error.

for file in files:
    try:
        document = Document(file)
    except:
        print(f"The file: {file} appears to be corrupted")

Output:

The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted

Semi Solution to Future Readers

Add a try and except block around the call to Document("Path/to/file.docx"), and print out the respective file for which the function failed. In my case it where just a few, which I could easily edit manually.


Solution

  • You are not doing wrong, since documents are empty you are getting this error. If you open those files type something, you will not get any error. But According to https://python-docx.readthedocs.io/en/latest/user/documents.html

    You can open word documents with different codes.

    First:

    document = Document()
    document.save(files[1])
    

    Second:

    document = Document(files[1])
    document.save(files[1])
    

    Also According to docs you can open them like files:

    with open(files[1], 'rb') as f:
        document = Document(f)