Search code examples
pythonxmllxmlopenxmlpython-docx

Why a Word file duplicated with python-docx has different XML than the original


I'm generating .docx files from Word documents that have changed Fonts but maintain the same format, style and information content. All has been done via the python-docx API.

There are still working-arounds not within the scope of the API (such as duplicating numeric bullets from source and inserting non-strings into Headers/Footers). I'm approaching these via lxml.

The XML of the original file and the generated file, while similar, are not identical irrespective of the lack of <w:numPr> tags. Why is that? The output .docx files look as expected.

This complicates working with low level lxml fixes.


Solution

  • Your assumption that there would be only a single way to represent a document in as complicated of a format as OOXML, especially generated from independently written code bases, is very much invalid.

    So the answer to your question is that multiple OOXML representations can yield the same appearance in Microsoft Word (or any other DOCX application); it is not safe to assume that any given library will write any given OOXML exactly the same as any given DOCX application would.