Search code examples
docxpython-docx

python-docx style error with 'List Paragraph'


I am using python-docx to convert Word docx files into a proprietary XML format.

I'm having trouble with bullet/enumerated lists. In a number of Word documents when I open them with python-docx and look at the paragraph style of the bullet/enumerated lists, some of the items in the list will be 'List Paragraph' but many of them will be 'Normal'.

Assuming they should all be 'List Paragraph', is there a way I can verify if this is an issue with the Word document or with the python-docx package?

Also, is there a way to identify these bullets/numbers when the paragraph style isn't what it should be? Eg. using paragraph_format?


Solution

  • A bullet-point can appear on a paragraph in Word at least two different ways:

    1. The user applies a paragraph style, like "List Paragraph"
    2. The user applies a bullet directly to the paragraph, probably using the bullet button on the toolbar.

    I suspect users tend to fall into one of these two habits. Using styles consistently allows you to adjust the formatting of all those paragraphs just by modifying the style. But I suspect 98%+ of users cultivate the "click the bullet button" habit.

    In any case, it's not surprising to find a document that's a mixed bag that way.

    Unfortunately, python-docx doesn't currently have support for directly-applied bullets, either for applying them or detecting them.

    If you have the skills to inspect the XML of the paragraph (print(paragraph._p.xml) is a start), then you can probably use an XPath expression on paragraph._p (the XML element underlying the paragraph) to detect if it has what I believe is a <w:bu> element, which would indicate it had a directly-applied bullet. Inspecting the XML of a paragraph known to have a directly applied bullet should give you the details of what you'd be looking for there.