python xml data-processing txt language-model

How to get text from XML via Python?

I'm training language model. My input are subtitles in XML format. I need to get just plain text from it and save to to a text file so I can work with it.

Input

<?xml version="1.0" encoding="utf-8"?> <document> <s id="1"> <time id="T1S" value="00:00:14,660" /> <w id="1.1">-</w> <w id="1.2">Všetko</w> <w id="1.3">v</w> <w id="1.4">poriadku</w> <w id="1.5">.</w> </s></document>

Output

- Všetko v poriadku .

Solution

In XML terms, you want the string-value of the XML element of interest.

Here is how to get the string-value of the root element in Python using XPath:

import lxml.etree as ET


xmlstr = """
  <r status="ready">
    <line>First line.</line>
    <line>Second line, with <i>italic text</i>.</line>
  </r>
"""

root = ET.fromstring(xmlstr)
svalue = root.xpath('string(/)')
print(svalue)

The above code prints only the text,

    First line.
    Second line, with italic text.

as requested