I have a TEI-encoded xml file with entities as follows:
<sp>
<speaker rend="italic">Sampson.</speaker>
<ab>
<lb n="5"/>
<hi rend="italic">Gregory:</hi>
<seg type="homograph">A</seg> my word wee'l not carry coales.<lb n="6"/>
</ab>
</sp>
<sp>
<speaker rend="italic">Greg.</speaker>
<ab>No, for then we should be Colliars.
<lb n="7" rend="rj"/>
</ab>
</sp>
The full file is very large but can be accessed here: http://ota.ox.ac.uk/desc/5721. I'm attempting to use Python 3 to traverse the xml and get all the text associated with the tag, which is where the dialogue is found.
import xml.etree.ElementTree as etree
tree = etree.parse('romeo_juliet_5721.xml')
doc = tree.getroot()
for i in doc.iter(tag='{http://www.tei-c.org/ns/1.0}ab'):
print(i.tag, i.text)
>>> http://www.tei-c.org/ns/1.0}ab
>>>
>>> {http://www.tei-c.org/ns/1.0}ab No, for then we should be Colliars.
The output catches the entities just fine but doesn't recognize "my word wee'l not carry coales" as the text of the first ab. If it's within a different element, I'm not seeing it. I've thought about converting the entire element to a string and getting the element text using regex (or by stripping all xml tags), but I would rather understand what's happening here. Thanks for any help you can provide.
That's because in the ElementTree
model, the text " my word wee'l not carry coales." is considered tail
of <seg>
element instead of text
of <ab>
. To get the text of an element as well as tail of its children, you can try this way :
for i in doc.iter(tag='{http://www.tei-c.org/ns/1.0}ab'):
innerText = i.text+''.join((text.tail or '') for text in i.iter()).strip()
print(i.tag, innerText)