Search code examples
xmllxmlnamed-entity-recognitiontei

Add a new element surrounding a given word in the texts of a given element and its tail using lxml


So I have a relatively complex XML encoding where the text can contain an open number of elements. Let's take this simplified example:

<div>
<p>-I like James <stage><hi>he said to her </hi></stage>, but I am not sure James understands <hi>Peter</hi>'s problems.</p>
</div>

I want to enclose all named entities in the sentence (the two instances of James and Peter) with an rs element:

<div>
<p>-I like <rs>James</rs> <stage><hi>he said to her </hi></stage>, but I am not sure <rs>James</rs> understands <hi><rs>Peter</rs></hi>'s problems.</p>
</div>

To simplify this, let's say I have a list of names I could find in the text, such as:

names = ["James", "Peter", "Mary"]

I want to use lxml for this. I know I could use the etree.SubElement() and append a new element at the end of the p element, but I don't know how to deal with the tails and the other possible elements.

I understand that I need to handle the three references in my example differently.

  1. The first James is in the text of the p element. I could just do this:
p = etree.SubElement(div, "p")
p.text = "-I like <rs>James</rs>"

Right?

  1. The second James is in the tail of the p element. I don't know how to deal with that.
  2. The reference to Peter is in the text of hi element. I guess I have to iterate through all possible elements, look both at the text and at the tail of each element and look for the named entities of my list.
rs = etree.SubElement(hi, "rs")
rs.text = "<rs>Peter</rs>"

My guess is that there is a much better way to handle all of this. Any help? Thanks in advance!


Solution

  • It's a little convoluted, but can be done.

    Let's say your XML looks like this:

    play = '''<?xml version="1.0" encoding="UTF-8"?>
    <root>
       <div>
          <p>
             -I like James
             <stage>
                <hi>he said to her</hi>
             </stage>
             , but I am not sure James understands
             <hi>Peter</hi>
             's problems.
          </p>
       </div>
       <div>
          <p>
             -I like Mary
             <stage>
                <hi>he said to her</hi>
             </stage>
             , but I am not sure Peter understands
             <hi>James</hi>
             's problems.
          </p>
       </div>
    </root>
    '''
    

    I inserted another div, and added formatting for clarity. Note that this assumes that each <div> contains only one <p>; if that's not the case, it will have to be refined more.

    doc = etree.XML(play.encode())
    names = ["James", "Peter", "Mary"]
    
    #find all the divs that need changing
    destinations = doc.xpath('//div')
    
    #extract the string representation of the current <p> (the "target")
    for destination in destinations:
        target = destination.xpath('./p')[0]
        target_str = etree.tostring(target).decode()
    
        #replace the names with the required tag:
        for name in names:
            if name in target_str:
                target_str = target_str.replace(name, f'<rs>{name}</rs>')
        
        #remove the original <p> and replace it with the new one,
        #as an element formed from the new string 
        destination.remove(target)
        destination.insert(0,etree.fromstring(target_str))
    
    print(etree.tostring(doc).decode())
    

    In this case, the output should be:

    <root>
       <div>
          <p>
             -I like <rs>James</rs>
             <stage>
                <hi>he said to her</hi>
             </stage>
             , but I am not sure <rs>James</rs> understands
             <hi><rs>Peter</rs></hi>
             's problems.
          </p></div>
       <div>
          <p>
             -I like <rs>Mary</rs>
             <stage>
                <hi>he said to her</hi>
             </stage>
             , but I am not sure <rs>Peter</rs> understands
             <hi><rs>James</rs></hi>
             's problems.
          </p></div>
    </root>