I'm training language model. My input are subtitles in XML format. I need to get just plain text from it and save to to a text file so I can work with it.
<?xml version="1.0" encoding="utf-8"?> <document> <s id="1"> <time id="T1S" value="00:00:14,660" /> <w id="1.1">-</w> <w id="1.2">Všetko</w> <w id="1.3">v</w> <w id="1.4">poriadku</w> <w id="1.5">.</w> </s></document>
- Všetko v poriadku .
In XML terms, you want the string-value of the XML element of interest.
Here is how to get the string-value of the root element in Python using XPath:
import lxml.etree as ET
xmlstr = """
<r status="ready">
<line>First line.</line>
<line>Second line, with <i>italic text</i>.</line>
</r>
"""
root = ET.fromstring(xmlstr)
svalue = root.xpath('string(/)')
print(svalue)
The above code prints only the text,
First line.
Second line, with italic text.
as requested