My python skills are a bit rusty since I recently primarily used Rstats. However I ran into the following problem, my goal is that I want to recursively iterate over all .docx
files in a directory and change some of the core attributes with the python-docx
package.
For the loop, I first created a list with pathlib and glob
from docx import Document
from docx.shared import Inches
import pathlib
# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files
Output of files looks fine.
[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]
When I now want to read in a document with the list I get a zip
error (see full traceback below)
document = Document(files[1])
Traceback (most recent call last):
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-482c5438fa33>", line 1, in <module>
document = Document(files[1])
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
self._zipf = ZipFile(pkg_file, 'r')
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
self._RealGetContents()
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
However just running the same line of code, without the list works fine (except for differences in the path separator /
and r"\"
, which I thought should not matter due to the fact that the lists contains pathlib.Path objects).
document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))
I created a total of 4 new word files for this mre. Now I entered text in two of them and two are empty. And to my surprise I found out that the empty ones result in the error.
for file in files:
try:
document = Document(file)
except:
print(f"The file: {file} appears to be corrupted")
Output:
The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted
Add a try
and except
block around the call to Document("Path/to/file.docx")
, and print out the respective file for which the function failed. In my case it where just a few, which I could easily edit manually.
You are not doing wrong, since documents are empty you are getting this error. If you open those files type something, you will not get any error. But According to https://python-docx.readthedocs.io/en/latest/user/documents.html
You can open word documents with different codes.
First:
document = Document()
document.save(files[1])
Second:
document = Document(files[1])
document.save(files[1])
Also According to docs you can open them like files:
with open(files[1], 'rb') as f:
document = Document(f)