Search code examples
pythonpython-docx

Using python-docx to to read .docx, preserving special characters, bullets


I am trying to batch manipulate many microsoft word documents in .docx format within python.

The following code accomplishes what I need, except it loses special characters I would like to preserve, like the right arrow symbol and bullets.

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return fullText

getText('example.docx')

Solution

  • The Paragraph.text property in python-pptx returns the plain text in the paragraph as a string. This is a very common requirement.

    Bullets, or numbered lists in general (of which bullets are a type) are not reflected in the text of a paragraph, even though it may appear that way on-screen. This sort of thing will be an additional property of the paragraph.

    One way bullets can be applied is using the 'List Bullet' style. The paragraph style is available on Paragraph.style.

    The documentation here is your friend for this and other details, in particular the 11 topics in the User Guide section:
    http://python-docx.readthedocs.io/en/latest/