Search code examples
node.jsxmldocx

Understanding MS Word XML - Troubleshoot Corrupt Document


I'm building an application that uses the NodeJS library docx to "patch" a MS Word document and send it to the client. In a prior revision, I had the docx library output a buffer, then I would use Libreoffice via command line to convert the document to a PDF. This seemed to work flawlessly, everytime.

My client decided they would rather just have the application output the MS Word document (docx) so they could make minor modifications as needed. I modified the code to download the MS Word document, but Word sees the document as corrupt each time I try to open it.

Trying to figure out why, I opened the docx up with 7zip, and began to examine the document.xml file inside. Everything seems fine, so I began to comment some of the XML out to try to find the issue.

There are tables in the document, and what I am noticing is that MS Word doesn't like it when I have a paragraph (w:p) inside a table cell (w:tc). The document opens fine when the code below is commented, but when I uncomment it, I get the standard "Word experienced an error trying to open the file. Try these suggestions...".

<w:tc>
  <w:tcPr>
    <w:tcW w:w="3116" w:type="dxa"/>
  </w:tcPr>

  <!-- commented code here
    <w:p>
      <w:r>
        <w:t>AAA</w:t>
      </w:r>
    </w:p> 
  -->

</w:tc>

Anyone able to explain what might be happening here? According to this documentation, it should be working. Could I be looking in the wrong area?

EDIT - I thought I should note that the document opens up fine in Google Docs. It does not open with MS Office 365 (both Desktop and in MS Teams).


Solution

  • I figured out what my issue was after exploring some OOXML validation tools.

    What I learned was that my table <w:tbl> was a child of a paragraph <w:p>, which is not allowed according to the schema. Removing the parent <w:p>...</w:p> tags resolved my issue.

    enter image description here