How can I replace whole xml elements in a word docx in python as if they were strings

My word document contains several symbol font lines that are not being recorded as text. When I use python-docx to view the underlying xml I can see the lines not being printed look like this:

<w:sym w:font="Symbol" w:char="F0B3"/>

but python-docx totally ignores w:sym tags. As if they weren't there at all when I'm extracting text. That means I can't just find and replace the symbols with the correct format. I need to be able to replace them before extracting the tables and text from my documents.

How can I turn the above tree elements into this the w:t versions like this:

<w:t>≥</w:t>

I'm totally fine setting up a dictionary for full line replacements. I just can't work out how to do it without breaking the xml file.

Solution

This is not supported by the python-docx API. You'll need to edit the XML in another way.

python-docx can give you access to the paragraph XML element (<w:p>) in the form of an lxml.etree._Element object and then you can use that API to manipulate its children. The basic idea would be to insert a new <w:t> element wherever you find a w:sym element and then remove the w:sym element.

The lxml.etree._Element API docs are here: https://lxml.de/api/lxml.etree._Element-class.html. The code might look something like this:

p = paragraph._p
for child_element in list(p):
    if child_element.tag != "w:sym":
        continue
    new_t_element = ...
    child_element.addprevious(new_t_element)
    p.remove(child_element)

There are still details of this to work out, but hopefully this gives you a direction to pursue. Perhaps you can post your solution here once you've resolved the details.