Search code examples
pythonxmldata-processingtxtlanguage-model

How to get text from XML via Python?


I'm training language model. My input are subtitles in XML format. I need to get just plain text from it and save to to a text file so I can work with it.

Input

<?xml version="1.0" encoding="utf-8"?> <document> <s id="1"> <time id="T1S" value="00:00:14,660" /> <w id="1.1">-</w> <w id="1.2">Všetko</w> <w id="1.3">v</w> <w id="1.4">poriadku</w> <w id="1.5">.</w> </s></document>

Output

- Všetko v poriadku . 

Solution

  • In XML terms, you want the string-value of the XML element of interest.

    Here is how to get the string-value of the root element in Python using XPath:

    import lxml.etree as ET
    
    
    xmlstr = """
      <r status="ready">
        <line>First line.</line>
        <line>Second line, with <i>italic text</i>.</line>
      </r>
    """
    
    root = ET.fromstring(xmlstr)
    svalue = root.xpath('string(/)')
    print(svalue)
    

    The above code prints only the text,

        First line.
        Second line, with italic text.
    

    as requested