So I have a relatively complex XML encoding where the text can contain an open number of elements. Let's take this simplified example:
<div>
<p>-I like James <stage><hi>he said to her </hi></stage>, but I am not sure James understands <hi>Peter</hi>'s problems.</p>
</div>
I want to enclose all named entities in the sentence (the two instances of James and Peter) with an rs
element:
<div>
<p>-I like <rs>James</rs> <stage><hi>he said to her </hi></stage>, but I am not sure <rs>James</rs> understands <hi><rs>Peter</rs></hi>'s problems.</p>
</div>
To simplify this, let's say I have a list of names I could find in the text, such as:
names = ["James", "Peter", "Mary"]
I want to use lxml for this. I know I could use the etree.SubElement()
and append a new element at the end of the p
element, but I don't know how to deal with the tails and the other possible elements.
I understand that I need to handle the three references in my example differently.
James
is in the text of the p
element. I could just do this:p = etree.SubElement(div, "p")
p.text = "-I like <rs>James</rs>"
Right?
James
is in the tail of the p
element. I don't know how to deal with that.Peter
is in the text of hi
element. I guess I have to iterate through all possible elements, look both at the text and at the tail of each element and look for the named entities of my list.rs = etree.SubElement(hi, "rs")
rs.text = "<rs>Peter</rs>"
My guess is that there is a much better way to handle all of this. Any help? Thanks in advance!
It's a little convoluted, but can be done.
Let's say your XML looks like this:
play = '''<?xml version="1.0" encoding="UTF-8"?>
<root>
<div>
<p>
-I like James
<stage>
<hi>he said to her</hi>
</stage>
, but I am not sure James understands
<hi>Peter</hi>
's problems.
</p>
</div>
<div>
<p>
-I like Mary
<stage>
<hi>he said to her</hi>
</stage>
, but I am not sure Peter understands
<hi>James</hi>
's problems.
</p>
</div>
</root>
'''
I inserted another div, and added formatting for clarity. Note that this assumes that each <div>
contains only one <p>
; if that's not the case, it will have to be refined more.
doc = etree.XML(play.encode())
names = ["James", "Peter", "Mary"]
#find all the divs that need changing
destinations = doc.xpath('//div')
#extract the string representation of the current <p> (the "target")
for destination in destinations:
target = destination.xpath('./p')[0]
target_str = etree.tostring(target).decode()
#replace the names with the required tag:
for name in names:
if name in target_str:
target_str = target_str.replace(name, f'<rs>{name}</rs>')
#remove the original <p> and replace it with the new one,
#as an element formed from the new string
destination.remove(target)
destination.insert(0,etree.fromstring(target_str))
print(etree.tostring(doc).decode())
In this case, the output should be:
<root>
<div>
<p>
-I like <rs>James</rs>
<stage>
<hi>he said to her</hi>
</stage>
, but I am not sure <rs>James</rs> understands
<hi><rs>Peter</rs></hi>
's problems.
</p></div>
<div>
<p>
-I like <rs>Mary</rs>
<stage>
<hi>he said to her</hi>
</stage>
, but I am not sure <rs>Peter</rs> understands
<hi><rs>James</rs></hi>
's problems.
</p></div>
</root>