My word document contains several symbol font lines that are not being recorded as text. When I use python-docx to view the underlying xml I can see the lines not being printed look like this:
<w:sym w:font="Symbol" w:char="F0B3"/>
but python-docx totally ignores w:sym tags. As if they weren't there at all when I'm extracting text. That means I can't just find and replace the symbols with the correct format. I need to be able to replace them before extracting the tables and text from my documents.
How can I turn the above tree elements into this the w:t versions like this:
<w:t>≥</w:t>
I'm totally fine setting up a dictionary for full line replacements. I just can't work out how to do it without breaking the xml file.
This is not supported by the python-docx
API. You'll need to edit the XML in another way.
python-docx
can give you access to the paragraph XML element (<w:p>
) in the form of an lxml.etree._Element
object and then you can use that API to manipulate its children. The basic idea would be to insert a new <w:t>
element wherever you find a w:sym
element and then remove the w:sym
element.
The lxml.etree._Element
API docs are here: https://lxml.de/api/lxml.etree._Element-class.html. The code might look something like this:
p = paragraph._p
for child_element in list(p):
if child_element.tag != "w:sym":
continue
new_t_element = ...
child_element.addprevious(new_t_element)
p.remove(child_element)
There are still details of this to work out, but hopefully this gives you a direction to pursue. Perhaps you can post your solution here once you've resolved the details.