Search code examples
pythonelementtree

Removing an element, but not the text after it


I have an XML file similar to this:

<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>

I want to remove all text in <b> or <u> elements (and descendants), and print the rest. This is what I tried:

from __future__ import print_function
import xml.etree.ElementTree as ET

tree = ET.parse('a.xml')
root = tree.getroot()

parent_map = {c:p for p in root.iter() for c in p}

for item in root.findall('.//b'):
  parent_map[item].remove(item)
for item in root.findall('.//u'):
  parent_map[item].remove(item)
print(''.join(root.itertext()).strip())

(I used the recipe in this answer to build the parent_map). The problem, of course, is that with remove(item) I'm also removing the text after the element, and the result is:

Some that I

whereas what I want is:

Some  text that I  want to keep.

Is there any solution?


Solution

  • If you won't end up using anything better, you can use clear() instead of remove() keeping the tail of the element:

    import xml.etree.ElementTree as ET
    
    
    data = """<root>
    <a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
    </root>"""
    
    tree = ET.fromstring(data)
    a = tree.find('a')
    for element in a:
        if element.tag in ('b', 'u'):
            tail = element.tail
            element.clear()
            element.tail = tail
    
    print ET.tostring(tree)
    

    prints (see empty b and u tags):

    <root>
    <a>Some <b /> text <i>that</i> I <u /> want to keep.</a>
    </root>
    

    Also, here's a solution using xml.dom.minodom:

    import xml.dom.minidom
    
    data = """<root>
    <a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
    </root>"""
    
    dom = xml.dom.minidom.parseString(data)
    a = dom.getElementsByTagName('a')[0]
    for child in a.childNodes:
        if getattr(child, 'tagName', '') in ('u', 'b'):
            a.removeChild(child)
    
    print dom.toxml()
    

    prints:

    <?xml version="1.0" ?><root>
    <a>Some  text <i>that</i> I  want to keep.</a>
    </root>