Search code examples
pythonxmllxml

lxml .text returns None when string contains tags


I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg> elements. Whenever <seg> element contains serialized tags, I get None object instead of a string.

Code that returns None:

source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text

Sample content of <seg> element that causes the issue:

<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>

Expected value of string variable source_segment:

<bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />

I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text cause it is a None object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg'), I get this:

b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>\n      '

Sample XML content:

<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
  <header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
    <prop type="x-Note:SingleString"></prop>
    <prop type="x-Recognizers">RecognizeAll</prop>
    <prop type="x-IncludesContextContent">True</prop>
    <prop type="x-TMName">XXXXXXXX</prop>
    <prop type="x-TokenizerFlags">DefaultFlags</prop>
    <prop type="x-WordCountFlags">DefaultFlags</prop>
  </header>
  <body>
    <tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
      <prop type="x-LastUsedBy">XXXXXXXX</prop>
      <prop type="x-Context">0, 0</prop>
      <prop type="x-Origin">TM</prop>
      <prop type="x-ConfirmationLevel">Translated</prop>
      <prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
      <prop type="x-Note:SingleString">XXXXXXXX</prop>
      <tuv xml:lang="en-GB">
        <seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
      </tuv>
      <tuv xml:lang="lt-LT">
        <seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
      </tuv>
    </tu>
  </body>
</tmx>

How do I extract the string from <seg> element when it contains serialized tags?


Solution

  • The best approach I found is to convert the parent child to a string, passing parameter 'encoding=str' to avoid step of decoding bytes-like object to string and preserve UTF-8 symbols. Then regex out the tags from the resulting string.

    import re
    from lxml import etree as ET
    
    root = ET.parse('seg.xml').getroot()
    
    seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')
    
    seg_string = ET.tostring(seg_elem, encoding=str)
    
    # Regex to strip <seg> tags
    seg_pattern = '(?<=<seg>).*?(?=</seg>)'
    # Strip <seg> tags
    final_string = re.search(seg_pattern, seg_string).group()