Search code examples
pythonxmlsymbolspython-docx

How can I replace whole xml elements in a word docx in python as if they were strings


My word document contains several symbol font lines that are not being recorded as text. When I use python-docx to view the underlying xml I can see the lines not being printed look like this:

<w:sym w:font="Symbol" w:char="F0B3"/>

but python-docx totally ignores w:sym tags. As if they weren't there at all when I'm extracting text. That means I can't just find and replace the symbols with the correct format. I need to be able to replace them before extracting the tables and text from my documents.

How can I turn the above tree elements into this the w:t versions like this:

<w:t>≥</w:t>

I'm totally fine setting up a dictionary for full line replacements. I just can't work out how to do it without breaking the xml file.


Solution

  • This is not supported by the python-docx API. You'll need to edit the XML in another way.

    python-docx can give you access to the paragraph XML element (<w:p>) in the form of an lxml.etree._Element object and then you can use that API to manipulate its children. The basic idea would be to insert a new <w:t> element wherever you find a w:sym element and then remove the w:sym element.

    The lxml.etree._Element API docs are here: https://lxml.de/api/lxml.etree._Element-class.html. The code might look something like this:

    p = paragraph._p
    for child_element in list(p):
        if child_element.tag != "w:sym":
            continue
        new_t_element = ...
        child_element.addprevious(new_t_element)
        p.remove(child_element)
    

    There are still details of this to work out, but hopefully this gives you a direction to pursue. Perhaps you can post your solution here once you've resolved the details.