Search code examples
pythonxmlannotationsbeautifulsoup

How to extract tag offsets in xml document using Python BeautifulSoup


I need some help finding the text offset of certain tags in an XML document. I have a data set following the format illustrated below where the ROOT element contains several RECORDs though each RECORD contains only one TEXT element. In the text there may exist several TAG elements used as annotations of some text. I need to convert these annotations to another format requiring begin and end offset of the tags using Python.

<ROOT>
    <RECORD ID="123">
        <TEXT>
        This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
        </TEXT>
    </RECORD>
</ROOT>

Basically, I would like to convert above format to the following format:

<ROOT>
    <RECORD ID="123">
        <TEXT>
        This is an example text written at December 29th to illustrate the problem.
        </TEXT>
        <TAG TYPE="DATE" BEGIN=36 END=49/>
    </RECORD>
</ROOT>

I've tried using BeautifulSoup but could not find a way of extracting the tag offsets. Ideas anyone?


Solution

  • The idea is to iterate over all TEXT nodes, find all TAG nodes inside, get the position of each TAG's text inside the TEXT's text and create new tag on the RECORD level, then unwrap() the TAG from TEXT:

    from bs4 import BeautifulSoup
    
    data = """
    <ROOT>
        <RECORD ID="123">
            <TEXT>
    This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
            </TEXT>
        </RECORD>
    </ROOT>
    """
    
    soup = BeautifulSoup(data, "xml")
    
    for text in soup.find_all('TEXT'):
    
        record = text.parent
        for tag in text.find_all('TAG'):
            begin = text.text.index(tag.text)
            end = len(tag.text) + begin
    
            record.append(soup.new_tag(tag.name, BEGIN=begin, END=end))
    
            tag.unwrap()
    
    print soup
    

    Prints:

    <?xml version="1.0" encoding="utf-8"?>
    <ROOT>
    <RECORD ID="123">
    <TEXT>
    This is an example text written at December 29th to illustrate the problem.
            </TEXT>
    <TAG BEGIN="36" END="49"/></RECORD>
    </ROOT>
    

    Note: haven't tested it if multiple TAGs appear on the TEXT level. But at least it should give you a starting point.