Search code examples
pythonxmlcharacter-encodingdocx

Problems extracting the XML from a Word document in French with Python: illegal characters generated


Over the past few days I have been attempting to create a script which would 1) extract the XML from a Word document, 2) modify that XML, and 3) use the new XML to create and save a new Word document. With the help of many stackoverflow users I was eventually able to find code that looks very promising. Here it is:

import zipfile
import os
import tempfile
import shutil

def getXml(docxFilename):
    zip = zipfile.ZipFile(open(docxFilename,"rb"))
    xmlString= zip.read("word/document.xml").decode("utf-8")
    return xmlString

def createNewDocx(originalDocx,xmlString,newFilename):
    tmpDir = tempfile.mkdtemp()
    zip = zipfile.ZipFile(open(originalDocx,"rb"))
    zip.extractall(tmpDir)
    with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
        f.write(xmlString)
    filenames = zip.namelist()
    zipCopyFilename = newFilename
    with zipfile.ZipFile(zipCopyFilename,"w") as docx:
        for filename in filenames:
            docx.write(os.path.join(tmpDir,filename),filename)
    shutil.rmtree(tmpDir)

getXml extracts the XML from docxFilename as a string. createNewDocx takes the original Word document and replaces its XML with xmlString, which is a modified version of the original XML, and saves the resulting Word document as newFilename.

To check that the script works as intended, I first created a test document ("test.docx") and ran createNewDocx("test.docx",getXml("test.docx"),"test2.docx"). If everything worked as intended, this was supposed to create an identical copy of test.docx saved as test2.docx. Indeed, that was the case.

I then made the test document more elaborate and experimented with modifying it. And the script still worked!

I then confidently applied my script to the Word document I was actually interested in modifying: template.docx. I ran createNewDocx("template.docx",getXml("template.docx"),"template2.docx"), expecting that the script would generate an identical copy of template.docx but named template2.docx. Unfortunately, the new Word document was not able to open; apparently there was an illegal character in the XML.

I really don't understand why my code would work for my test document but not for my actual document. I would post template.docx's XML but it contains personal information. One important difference between test.docx and template.docx is that template.docx is written in French, and therefore contains special characters like accents, and also the apostrophes look different. I have no idea if this is what's causing my trouble but I have no other ideas.


Solution

  • The problem is that you are accidentally changing the encoding on word/document.xml in template2.docx. word/document.xml (from template.docx) is initially encoded as UTF-8 (as is the default encoding for XML documents).

    xmlString = zip.read("word/document.xml").decode("utf-8")
    

    However, when you copy it for template2.docx you are changing the encoding to CP-1252. According to the documentation for open(file, "w"),

    In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

    You indicated that calling locale.getpreferredencoding(False) gives you cp1252 which is the encoding word/document.xml is being written.

    Since you did not explicitly add <?xml version="1.0" encoding="cp1252"?> to the beginning of word/document.xml, Word (or any other XML reader) will read it as UTF-8 instead of CP-1252 which is what gives you the illegal XML character error.

    So you want to specify the encoding as UTF-8 when writing by using the encoding argument to open():

    with open(os.path.join(tmpDir, "word/document.xml"), "w", encoding="UTF-8") as f:
        f.write(xmlString)