Search code examples
pythonxmllxml

Remove all nested XML tags


I am trying to remove all "nested tags of same type". For every XML element, if you find another subelement in its subtree that has same name, remove its tag (keep its contents). In another words, transform <a>...<a>...</a>...</a> into <a>.........</a>.

I created a very nice and simple piece of code using functions iter and strip_tags from the lxml package:

import lxml.etree

root = lxml.etree.parse('book.txt')

for element in root.iter():
    lxml.etree.strip_tags(element, element.tag)

print(lxml.etree.tostring(root).decode())

I used this input file:

<book>
    <b><title>My <b>First</b> Book</title></b>
    <i>Introduction <i><i>To</i></i> LXML</i>
    <name><a>Author: <a>James</a></a></name>
</book>

and I got this output:

<book>
    <b><title>My First Book</title></b>
    <i>Introduction To LXML</i>
    <name><a>Author: <a>James</a></a></name>
</book>

As you can see, it removed almost all the nested tags except one: <a>Author: <a>James</a></a>. What is wrong with the code? How can I fix it?


Solution

  • It is not safe to modify the XML tree while iterating over it. Instead, iterate over a list of all elements.

    import lxml.etree
    
    root = lxml.etree.parse('book.txt')
    
    all_elements = list(root.iter())
    
    for element in all_elements:
        lxml.etree.strip_tags(element, element.tag)
        
    print(lxml.etree.tostring(root).decode())
    

    Output:

    <book>
        <b><title>My First Book</title></b>
        <i>Introduction To LXML</i>
        <name><a>Author: James</a></name>
    </book>