I am trying to use Python to parse an xml file. I would like to identify text which occurs between specified xml tags.
The code I am running is
import xml.etree.ElementTree as ET
tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.iter('w'):
print(w.text)
The xml file is as follows. It's a complex file with quite a loose structure, which combines elements of sequence and hierarchy (and I have simplified it for the purposes of this query), but there clearly is a "w" tag, which should be getting picked up by the code.
Thanks.
<?xml version="1.0" encoding="UTF-8"?>
<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.talkbank.org/ns/talkbank"
xsi:schemaLocation="http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd"
Media="020012" Mediatypes="audio"
DesignType="long"
ActivityType="toyplay"
GroupType="TD"
PID="11312/c-00018213-1"
Version="2.20.0"
Lang="eng"
Options="bullets"
Corpus="xxxx"
Date="xxxx-xx-xx"
>
<Participants>
<participant
id="MOT"
name="Mother"
role="Mother"
language="eng"
sex="female"
/>
</Participants>
<comment type="Date">15-APR-1999</comment>
<u who="INV" uID="u0">
<w untranscribed="untranscribed">www</w>
<t type="p"></t>
<media
start="7.639"
end="9.648"
unit="s"
/>
<a type="addressee">MOT</a>
</u>
<u who="MOT" uID="u1">
<w untranscribed="untranscribed">www</w>
<t type="p"></t>
<media
start="7.640"
end="9.455"
unit="s"
/>
<a type="addressee">INV</a>
</u>
<u who="CHI" uID="u2">
<w untranscribed="unintelligible">xxx</w>
<w formType="family-specific">choo_choos<mor type="mor"><mw><pos><c>fam</c></pos><stem>choo_choos</stem></mw><gra type="gra" index="1" head="0" relation="INCROOT"/></mor></w>
<t type="p"><mor type="mor"><mt type="p"/><gra type="gra" index="2" head="1" relation="PUNCT"/></mor></t>
<postcode>I</postcode>
<media
start="10.987"
end="12.973"
unit="s"
/>
<a type="comments">looking at pictures of trains</a>
</u>
</CHAT>
You can also define the namespace for further usage and use iterfind
:
NS = { 'ww' : 'http://www.talkbank.org/ns/talkbank' }
for w in root.iterfind('.//ww:w',NS):
print(w.text)
Result would be
www
www
xxx
choo_choos