Search code examples
python-3.xzippython-docx

python-docx: Error opening file - "Bad magic number for file header" / "EOFError"


The company I work for distributes document assembly software that uses the python-docx library. The software runs a function on every generated document that opens the document and does a simple search and replace for characters that weren't escaped properly (namely "& amp;" -> "&").

FYI The actual document assembly uses python-docx-template. However, the error happens after the document has already been assembled and the error is triggered by the search-and-replace function, which only uses python-docx.

Recently, we've had a few cases where documents are failing to generate on client deployments. They're throwing an error on this line where the document object is instantiated:

doc = Document(docx=Path(doc_path))

We've seen two errors:

raise BadZipFile("Bad magic number for file header")

and

raise EOFError

The software is widely used and we've never had this issue before. We can't reproduce it in our test environments. The error has only started appearing in the past week but has shown up for several clients after they were updated. The software will fail to generate a particular document some number of times but will succeed after a few tries.

We've only seen it happen with one document in particular, but all documents use the same search and replace function, and like I said the error is only intermittent with the problem document.

There have been no changes in code to this search and replace function and I can't think of any other meaningful difference to our doc assembly process that would explain this.

I'm having a lot of trouble finding info on what could cause this specifically with the python-docx library. Is this a sign that the generated document is corrupted? If anyone is able to shed some light on possible causes that would be very helpful!

Here's the stack trace for both errors:

Bad magic number...

File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
    doc = Document(docx=Path(doc_path))
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
    phys_reader, pkg_srels, content_types
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
    for partname, blob, reltype, srels in part_walker:
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 104, in _walk_phys_parts
    part_srels = PackageReader._srels_for(phys_reader, partname)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 83, in _srels_for
    rels_xml = phys_reader.rels_xml_for(source_uri)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 129, in rels_xml_for
    rels_xml = self.blob_for(source_uri.rels_uri)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
    return self._zipf.read(pack_uri.membername)
  File "/usr/lib/python3.6/zipfile.py", line 1337, in read
    with self.open(name, "r", pwd) as fp:
  File "/usr/lib/python3.6/zipfile.py", line 1396, in open
    raise BadZipFile("Bad magic number for file header")

zipfile.BadZipFile: Bad magic number for file header

EOFError

File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
    doc = Document(docx=Path(doc_path))
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
    phys_reader, pkg_srels, content_types
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
    for partname, blob, reltype, srels in part_walker:
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 110, in _walk_phys_parts
    for partname, blob, reltype, srels in next_walker:
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 105, in _walk_phys_parts
    blob = phys_reader.blob_for(partname)
  File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
    return self._zipf.read(pack_uri.membername)
  File "/usr/lib/python3.6/zipfile.py", line 1338, in read
    return fp.read()
  File "/usr/lib/python3.6/zipfile.py", line 858, in read
    buf += self._read1(self.MAX_N)
  File "/usr/lib/python3.6/zipfile.py", line 940, in _read1
    data += self._read2(n - len(data))
  File "/usr/lib/python3.6/zipfile.py", line 975, in _read2
    raise EOFError
EOFError

Solution

  • It turns out some code was added recently before this started happening, which effectively sent a duplicate request to the server to generate the document in question. These requests seem to run in parallel - which is surprising because I would predict the conflict to happen much more frequently (same template file being used, generated document writing to the same directory).

    Seems like if the sequence of the requests happened in a particular timing, the "find-and-replace" operation of one request would run into the "save" operation of the other request. So in other words I think one request was trying to open a document that was in the process of being saved.

    So I'm glad it's not something more obscure with the python-docx library, which would have been a lot harder to nail down.