I need some help finding the text offset of certain tags in an XML document. I have a data set following the format illustrated below where the ROOT element contains several RECORDs though each RECORD contains only one TEXT element. In the text there may exist several TAG elements used as annotations of some text. I need to convert these annotations to another format requiring begin and end offset of the tags using Python.
<ROOT>
<RECORD ID="123">
<TEXT>
This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
</TEXT>
</RECORD>
</ROOT>
Basically, I would like to convert above format to the following format:
<ROOT>
<RECORD ID="123">
<TEXT>
This is an example text written at December 29th to illustrate the problem.
</TEXT>
<TAG TYPE="DATE" BEGIN=36 END=49/>
</RECORD>
</ROOT>
I've tried using BeautifulSoup but could not find a way of extracting the tag offsets. Ideas anyone?
The idea is to iterate over all TEXT
nodes, find all TAG
nodes inside, get the position of each TAG
's text inside the TEXT
's text and create new tag on the RECORD
level, then unwrap()
the TAG
from TEXT
:
from bs4 import BeautifulSoup
data = """
<ROOT>
<RECORD ID="123">
<TEXT>
This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem.
</TEXT>
</RECORD>
</ROOT>
"""
soup = BeautifulSoup(data, "xml")
for text in soup.find_all('TEXT'):
record = text.parent
for tag in text.find_all('TAG'):
begin = text.text.index(tag.text)
end = len(tag.text) + begin
record.append(soup.new_tag(tag.name, BEGIN=begin, END=end))
tag.unwrap()
print soup
Prints:
<?xml version="1.0" encoding="utf-8"?>
<ROOT>
<RECORD ID="123">
<TEXT>
This is an example text written at December 29th to illustrate the problem.
</TEXT>
<TAG BEGIN="36" END="49"/></RECORD>
</ROOT>
Note: haven't tested it if multiple TAG
s appear on the TEXT
level. But at least it should give you a starting point.