Search code examples
pythonms-wordpython-docx

How to read docx originated from Word templates with python-docx?


I'm getting all the text of a docx file using the python-docx library. The simplified code for that is as follows

from docx import Document

def read_element(doc):
    for p in doc.paragraphs:
        print('paragraph text:', p.text)
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                read_element(cell)

doc = Document("<path to file>")

read_element(doc)

This works great for many cases, except for when I'm reading from a file originated via Microsoft Word template. In those cases it only reads the input I wrote in the file, but not the text that comes with the template

To replicate

  • Create Microsoft Word document via Create from template
  • Write one word in it, i.e. "testing"
  • Save it
  • Replace its path in the code above
  • Run code

Output:

paragraph text:  testing
paragraph text: To learn more and get OneNote, visit .

When the file has more text than what the output says

Take Notes testing

  • To take notes, just tap here and start typing.
  • Or, easily create a digital notebook for all your notes that automatically syncs across your devices, using the free OneNote app.

To learn more and get OneNote, visit www.onenote.com.

As we also can see in the image of the file we are trying to read Docx file

Any ideas on how to retrieve the missing text?


Solution

  • python-docx will only find paragraphs and tables at the top-level of the document. In particular, paragraphs or tables "wrapped" in a "container" element will not be detected.

    Most commonly, the "container" is a pending (not yet accepted) revision and this produces a similar behavior.

    To extract the "wrapped" text, you'll need to know what the "wrapper" elements are. One way to do that is by dumping the XML of the document body:

    document = Document("my-document.docx")
    print(document._body._body.xml)
    

    A paragraph element has a w:p tag and you can inspect the output to look for those, some of which I expect will be inside another element.

    Then you can extract those elements with XPath expressions, something like this, which would work if the "wrapper" element was <w:x>:

    from docx.text.paragraph import Paragraph
    
    body = document._body._body
    ps_under_xs = body.xpath("w:x//w:p")
    for p in ps_under_xs:
        paragraph = Paragraph(p, None)
        print(paragraph.text)
    

    You could also just get all the <w:p> elements in the document, without regard to their "parentage" with something like this:

    ps = body.xpath(".//w:p")
    

    The drawback of this is that some containers (like unaccepted revision marks) can contain text that has been "deleted" from the document, so you might get more than what you wanted.

    In any case, this general approach should work for the job you've described. You can find more about XPath expressions on search if you need something more sophisticated.