I am trying to remove all "nested tags of same type". For every XML element, if you find another subelement in its subtree that has same name, remove its tag (keep its contents). In another words, transform <a>...<a>...</a>...</a>
into <a>.........</a>
.
I created a very nice and simple piece of code using functions iter
and strip_tags
from the lxml
package:
import lxml.etree
root = lxml.etree.parse('book.txt')
for element in root.iter():
lxml.etree.strip_tags(element, element.tag)
print(lxml.etree.tostring(root).decode())
I used this input file:
<book>
<b><title>My <b>First</b> Book</title></b>
<i>Introduction <i><i>To</i></i> LXML</i>
<name><a>Author: <a>James</a></a></name>
</book>
and I got this output:
<book>
<b><title>My First Book</title></b>
<i>Introduction To LXML</i>
<name><a>Author: <a>James</a></a></name>
</book>
As you can see, it removed almost all the nested tags except one: <a>Author: <a>James</a></a>
. What is wrong with the code? How can I fix it?
It is not safe to modify the XML tree while iterating over it. Instead, iterate over a list of all elements.
import lxml.etree
root = lxml.etree.parse('book.txt')
all_elements = list(root.iter())
for element in all_elements:
lxml.etree.strip_tags(element, element.tag)
print(lxml.etree.tostring(root).decode())
Output:
<book>
<b><title>My First Book</title></b>
<i>Introduction To LXML</i>
<name><a>Author: James</a></name>
</book>