Search code examples
pythonxmlelementtree

Parsing xml file with Python using root.iter does not list text


I am trying to use Python to parse an xml file. I would like to identify text which occurs between specified xml tags.

The code I am running is


import xml.etree.ElementTree as ET
tree = ET.parse('020012_doctored.xml')
root = tree.getroot()
for w in root.iter('w'):
    print(w.text)

The xml file is as follows. It's a complex file with quite a loose structure, which combines elements of sequence and hierarchy (and I have simplified it for the purposes of this query), but there clearly is a "w" tag, which should be getting picked up by the code.

Thanks.

<?xml version="1.0" encoding="UTF-8"?>

<CHAT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns="http://www.talkbank.org/ns/talkbank"
      xsi:schemaLocation="http://www.talkbank.org/ns/talkbank https://talkbank.org/software/talkbank.xsd"
      Media="020012" Mediatypes="audio"
            DesignType="long"
            ActivityType="toyplay"
            GroupType="TD"
      PID="11312/c-00018213-1"
      Version="2.20.0"
      Lang="eng"
      Options="bullets"
      Corpus="xxxx"
      Date="xxxx-xx-xx"
      >
  <Participants>
    <participant
      id="MOT"
    name="Mother"
      role="Mother"
      language="eng"
      sex="female"
    />
  </Participants>
  <comment type="Date">15-APR-1999</comment>
  <u who="INV" uID="u0">
    <w untranscribed="untranscribed">www</w>
    <t type="p"></t>
    <media
      start="7.639"
      end="9.648"
      unit="s"
    />
    <a type="addressee">MOT</a>
  </u>
  <u who="MOT" uID="u1">
    <w untranscribed="untranscribed">www</w>
    <t type="p"></t>
    <media
      start="7.640"
      end="9.455"
      unit="s"
    />
    <a type="addressee">INV</a>
  </u>
  <u who="CHI" uID="u2">
    <w untranscribed="unintelligible">xxx</w>
    <w formType="family-specific">choo_choos<mor type="mor"><mw><pos><c>fam</c></pos><stem>choo_choos</stem></mw><gra type="gra" index="1" head="0" relation="INCROOT"/></mor></w>
    <t type="p"><mor type="mor"><mt type="p"/><gra type="gra" index="2" head="1" relation="PUNCT"/></mor></t>
    <postcode>I</postcode>
    <media
      start="10.987"
      end="12.973"
      unit="s"
    />
    <a type="comments">looking at pictures of trains</a>
  </u>

  </CHAT>


Solution

  • You can also define the namespace for further usage and use iterfind:

    NS = { 'ww' : 'http://www.talkbank.org/ns/talkbank' }
    for w in root.iterfind('.//ww:w',NS):
        print(w.text)
    

    Result would be

    www
    www
    xxx
    choo_choos