I'm trying to work with some XML files to do sentence tagging whilst maintaining the original structure of the file. The files look like so:
<text xml:lang="">
<body>
<div>
<p>
<p>
<lb xml:id="p1z1" />19.
<lb xml:id="p1z2" />esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
<lb xml:id="p1z3" />esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
<lb xml:id="p1z4" />et affinium generi tui responsum fratri meo coram dedisse, non
<lb xml:id="p1z5" />possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
<lb xml:id="p1z6" />fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
<lb xml:id="p1z7" />quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
<lb xml:id="p1z8" />vel generum mihi per literas responsurum. Frater igitur dixit quidem
<lb xml:id="p1z9" />mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
<lb xml:id="p1z10" />respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
<lb xml:id="p1z11" />promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
<lb xml:id="p1z12" />non potui aliter interpretari quam ali fortassis aliquid monstri,
<lb xml:id="p1z13" />ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
<lb xml:id="p1z14" />nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
<lb xml:id="p1z15" />difficultates sunt ortae, iampridem domino deque commendavi, qui
<lb xml:id="p1z16" />per Mosen. Mea est ultro et ego retribuam eis in tempore.
<lb xml:id="p1z17" />De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
<lb xml:id="p1z18" />affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
</p>
</div>
</body>
</text>
</TEI>
The sentences I need to tag span over several lines. The lines are tagged with the line break tag "<lb xml:id="n" />
". I need to somehow tag the sentences, and then append them back with their original formal to the file. The issue I encounter is that while the text contains newline characters, as soon as I create an instance of a sentence and try to append to the line break tag, the new line character isn't valid....
The output should look like:
<text xml:lang="">
<body>
<div>
<p>
<p>
<lb xml:id="p1z1" /><s n="1" xml:lang="la">19.</s>
<lb xml:id="p1z2" /><s n="1" xml:lang="la">esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
<lb xml:id="p1z3" />esse epistolam meam interpretatum.</s><s n="2" xml:lang="la"> Caeterum, quod scribis te ex consilio consanguine
<lb xml:id="p1z4" />et affinium generi tui responsum fratri meo coram dedisse, non
<lb xml:id="p1z5" />possum satis mirari, qui hoc factum sit.</s><s n="3" xml:lang="la"> Res enim ista ad me suum ad
<lb xml:id="p1z6" />fratrem pertinebat.</s><s n="4" xml:lang="la"> Nec ita fueram abs te dimissus, quod vel tu tale
<lb xml:id="p1z7" />quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
<lb xml:id="p1z8" />vel generum mihi per literas responsurum.</s><s n="5" xml:lang="la"> Frater igitur dixit quidem
<lb xml:id="p1z9" />mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
<lb xml:id="p1z10" />respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
<lb xml:id="p1z11" />promisisses, ita faceres.</s><s n="6" xml:lang="la"> Ego simulatque tergiversationem istam cognoscere
<lb xml:id="p1z12" />non potui aliter interpretari quam ali fortassis aliquid monstri,
<lb xml:id="p1z13" />ut dicitur.</s><s n="7" xml:lang="la"> Nam quae plana sunt et integra sive dicantur sive scripsisse
<lb xml:id="p1z14" />nihil refert.</s><s n="8" xml:lang="la"> Utut sit, ego iniuriam illam, ex qua omnes istae
<lb xml:id="p1z15" />difficultates sunt ortae, iampridem domino deque commendavi, qui
<lb xml:id="p1z16" />per Mosen.</s><s n="9" xml:lang="la"> Mea est ultro et ego retribuam eis in tempore.</s>
<lb xml:id="p1z17" /><s n="10" xml:lang="la">De altero etiam capite accipio tuam excusationem.</s><s n="11" xml:lang="la"> Quum enim tam sancte
<lb xml:id="p1z18" />affirmes te semper erga nos non aliter quam bene et fuisse et
...
...
...
</p>
</div>
</body>
</text>
</TEI>
My code looks like:
import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize
import nltk
# Ensure NLTK's sentence tokenizer is available
nltk.download('punkt')
def remove_ns_prefix(tree):
for elem in tree.iter():
if '}' in elem.tag:
elem.tag = elem.tag.split('}', 1)[1] # Removing namespace
return tree
def process_file(input_xml, output_xml):
tree = ET.parse(input_xml)
root = remove_ns_prefix(tree.getroot())
for body in root.findall('.//body'):
for paragraph in body.findall('.//p'):
# Extract all lb elements and following texts
lb_elements = list(paragraph.findall('.//lb'))
lb_ids = [lb.attrib.get('xml:id', '') for lb in lb_elements] # Store lb ids
text_after_lb = [(lb.tail if lb.tail else '') for lb in lb_elements]
# Combine the text and tokenize into sentences
entire_text = ' '.join(text_after_lb)
sentences = sent_tokenize(entire_text)
sentences2 = " ".join(sentences).split("\n")
print(sentences2)
# Clear the paragraph's existing content
paragraph.clear()
# Pair up lb tags and sentences using zip, reinsert them into the paragraph
for lb_id, sentence in zip(lb_ids, sentences):
# Reinsert lb element
lb_attrib = {'xml:id': lb_id} if lb_id else {}
new_lb = ET.SubElement(paragraph, 'lb', attrib=lb_attrib)
# Attach sentence to this lb
if sentence:
sentence_elem = ET.SubElement(paragraph, 's', attrib={'xml:lang': 'la'})
sentence_elem.text = sentence
# Write the modified tree to a new file
tree.write(output_xml, encoding='utf-8', xml_declaration=True, method='xml')
I'm losing my mind. Hopefully I have an XML pro who is willing to come to my rescue.
I've also tried first tagging, and then reinserting the line break tags afterwards, but due to the nature of XML it's tough. The next thing I would maybe attempt is to create temporary .txt files and go line by line and insert the tags on the lines that don't match...
Any and all help appreciated at this point.
The job can be done taking advantage of tail
attribute of lb
elements which are the items with index > 0 in this list (element.tail split by r'(\.|\n)'
regexp). Label element is placed detecting sentence start and end (dots).
['<lb xml:id="p1z1"/>', '19', '.', '', '\n', ' ']
that list represents this element; quoted to show whitespace
'<lb xml:id="p1z1"/>19.
'
Script does no take into account namespaces and is provided as POC of the parsing technique. It could be cleaner to label sentences with a self closing element
<lb xml:id="p1z2"/><s n="2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
<lb xml:id="p1z3"/>esse epistolam meam interpretatum.<s n="3"/> Caeterum, quod scribis te ex consilio consanguine
Given this sample
<text xml:lang="">
<body>
<div>
<p>
<p>
<lb xml:id="p1z1"/>19.
<lb xml:id="p1z2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
<lb xml:id="p1z3"/>esse epistolam meam interpretatum. Caeterum, quod scribis te ex consilio consanguine
<lb xml:id="p1z4"/>et affinium generi tui responsum fratri meo coram dedisse, non
<lb xml:id="p1z5"/>possum satis mirari, qui hoc factum sit. Res enim ista ad me suum ad
<lb xml:id="p1z6"/>fratrem pertinebat. Nec ita fueram abs te dimissus, quod vel tu tale
<lb xml:id="p1z7"/>quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
<lb xml:id="p1z8"/>vel generum mihi per literas responsurum. Frater igitur dixit quidem
<lb xml:id="p1z9"/>mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
<lb xml:id="p1z10"/>respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
<lb xml:id="p1z11"/>promisisses, ita faceres. Ego simulatque tergiversationem istam cognoscere
<lb xml:id="p1z12"/>non potui aliter interpretari quam ali fortassis aliquid monstri,
<lb xml:id="p1z13"/>ut dicitur. Nam quae plana sunt et integra sive dicantur sive scripsisse
<lb xml:id="p1z14"/>nihil refert. Utut sit, ego iniuriam illam, ex qua omnes istae
<lb xml:id="p1z15"/>difficultates sunt ortae, iampridem domino deque commendavi, qui
<lb xml:id="p1z16"/>per Mosen. Mea est ultro et ego retribuam eis in tempore.
<lb xml:id="p1z17"/>De altero etiam capite accipio tuam excusationem. Quum enim tam sancte
<lb xml:id="p1z18"/>affirmes te semper erga nos non aliter quam bene et fuisse et
</p>
</p>
</div>
</body>
</text>
Result
<text xml:lang="">
<body>
<div>
<p>
<p>
<lb xml:id="p1z1"/><s n="1"/>19.
<lb xml:id="p1z2"/><s n="2"/>esse Christolam meam te adeo candide et humaniter Bullingere colendissime,
<lb xml:id="p1z3"/>esse epistolam meam interpretatum.<s n="3"/> Caeterum, quod scribis te ex consilio consanguine
<lb xml:id="p1z4"/>et affinium generi tui responsum fratri meo coram dedisse, non
<lb xml:id="p1z5"/>possum satis mirari, qui hoc factum sit.<s n="4"/> Res enim ista ad me suum ad
<lb xml:id="p1z6"/>fratrem pertinebat.<s n="5"/> Nec ita fueram abs te dimissus, quod vel tu tale
<lb xml:id="p1z7"/>quid reciperes vel ego probarem, sed ita tua sponte pollicebaris vel te,
<lb xml:id="p1z8"/>vel generum mihi per literas responsurum.<s n="6"/> Frater igitur dixit quidem
<lb xml:id="p1z9"/>mihi te in praesentia nescio quorum (qui namque fuerint excidit) voluisse
<lb xml:id="p1z10"/>respondere se vero voluisse recipere, imo admonuisse te ut, quemadmodum
<lb xml:id="p1z11"/>promisisses, ita faceres.<s n="7"/> Ego simulatque tergiversationem istam cognoscere
<lb xml:id="p1z12"/>non potui aliter interpretari quam ali fortassis aliquid monstri,
<lb xml:id="p1z13"/>ut dicitur.<s n="8"/> Nam quae plana sunt et integra sive dicantur sive scripsisse
<lb xml:id="p1z14"/>nihil refert.<s n="9"/> Utut sit, ego iniuriam illam, ex qua omnes istae
<lb xml:id="p1z15"/>difficultates sunt ortae, iampridem domino deque commendavi, qui
<lb xml:id="p1z16"/>per Mosen.<s n="10"/> Mea est ultro et ego retribuam eis in tempore.
<lb xml:id="p1z17"/><s n="11"/>De altero etiam capite accipio tuam excusationem.<s n="12"/> Quum enim tam sancte
<lb xml:id="p1z18"/>affirmes te semper erga nos non aliter quam bene et fuisse et
</p>
</p>
</div>
</body>
</text>
Set self_close = False
to get the OP's labels. With restoring parsed elements back to the doc
import re
from lxml import etree
doc = etree.parse('/home/luis/tmp/tmp.xml')
# find parent element
parent = doc.xpath('//div/p/p')[0]
# keep indentation of first lb
all='<p>' + parent.text
i=1
is_open=False
self_close = True
for t in parent.xpath('lb'):
parts = ['']
parts.extend(re.split(r'(\.|\n)', t.tail))
t.tail=None
parts[0]=etree.tostring(t).decode('utf-8')
#print(parts)
for p, e in enumerate(parts):
skip = (e == '' or re.match(r'^(\n|\s+)$', e) is not None)
if p > 0 and not is_open and not skip:
if self_close:
parts[p] = f'<s n="{i}"/>{e}'
else:
parts[p] = f'<s n="{i}">{e}'
is_open=True
elif is_open and e == '.':
if not self_close:
parts[p] = '.</s>'
is_open=False
i += 1
elif p == len(parts) - 1:
all += ''.join(parts)
else:
continue
# last sentence does not end with a dot?
# hardcoded here but could be detected
if not self_close:
all+='</s>'
all +='</p>'
# parse back to an element
xfrag = etree.fromstring(all)
xfrag.tail = parent.tail
# replace parent element on document
parent.getparent().replace(parent, xfrag)
print(etree.tostring(doc).decode('utf-8'))