Search code examples
pythonxmlelementtreetailtei

Traversing TEI in Python 3, text comes up empty for some entities


I have a TEI-encoded xml file with entities as follows:

<sp>
    <speaker rend="italic">Sampson.</speaker>
    <ab>
         <lb n="5"/>
         <hi rend="italic">Gregory:</hi>
         <seg type="homograph">A</seg> my word wee'l not carry coales.<lb n="6"/>
    </ab>
</sp>
<sp>
     <speaker rend="italic">Greg.</speaker>
     <ab>No, for then we should be Colliars.
         <lb n="7" rend="rj"/>
     </ab>
</sp>

The full file is very large but can be accessed here: http://ota.ox.ac.uk/desc/5721. I'm attempting to use Python 3 to traverse the xml and get all the text associated with the tag, which is where the dialogue is found.

import xml.etree.ElementTree as etree
tree = etree.parse('romeo_juliet_5721.xml')
doc = tree.getroot()
for i in doc.iter(tag='{http://www.tei-c.org/ns/1.0}ab'):   
        print(i.tag, i.text)
>>> http://www.tei-c.org/ns/1.0}ab 
>>>                  
>>> {http://www.tei-c.org/ns/1.0}ab No, for then we should be Colliars.

The output catches the entities just fine but doesn't recognize "my word wee'l not carry coales" as the text of the first ab. If it's within a different element, I'm not seeing it. I've thought about converting the entire element to a string and getting the element text using regex (or by stripping all xml tags), but I would rather understand what's happening here. Thanks for any help you can provide.


Solution

  • That's because in the ElementTree model, the text " my word wee'l not carry coales." is considered tail of <seg> element instead of text of <ab>. To get the text of an element as well as tail of its children, you can try this way :

    for i in doc.iter(tag='{http://www.tei-c.org/ns/1.0}ab'): 
        innerText = i.text+''.join((text.tail or '') for text in i.iter()).strip()  
        print(i.tag, innerText)