I am using python-docx to convert Word docx files into a proprietary XML format.
I'm having trouble with bullet/enumerated lists. In a number of Word documents when I open them with python-docx and look at the paragraph style of the bullet/enumerated lists, some of the items in the list will be 'List Paragraph' but many of them will be 'Normal'.
Assuming they should all be 'List Paragraph', is there a way I can verify if this is an issue with the Word document or with the python-docx package?
Also, is there a way to identify these bullets/numbers when the paragraph style isn't what it should be?
Eg. using paragraph_format
?
A bullet-point can appear on a paragraph in Word at least two different ways:
I suspect users tend to fall into one of these two habits. Using styles consistently allows you to adjust the formatting of all those paragraphs just by modifying the style. But I suspect 98%+ of users cultivate the "click the bullet button" habit.
In any case, it's not surprising to find a document that's a mixed bag that way.
Unfortunately, python-docx
doesn't currently have support for directly-applied bullets, either for applying them or detecting them.
If you have the skills to inspect the XML of the paragraph (print(paragraph._p.xml)
is a start), then you can probably use an XPath expression on paragraph._p
(the XML element underlying the paragraph) to detect if it has what I believe is a <w:bu>
element, which would indicate it had a directly-applied bullet. Inspecting the XML of a paragraph known to have a directly applied bullet should give you the details of what you'd be looking for there.