I am a traversing complex XML file with millions of TU nodes and extracting strings from <seg>
elements. Whenever <seg>
element contains serialized tags, I get None
object instead of a string.
Code that returns None
:
source_segment = ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
Sample content of <seg>
element that causes the issue:
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
Expected value of string variable source_segment
:
<bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" />
I cant serialize ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg').text
cause it is a None
object. If I serialize only part ET.parse(file).getroot().find('body').findall('tu')[0].findall('tuv')[0].find('seg')
, I get this:
b'<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>\n '
Sample XML content:
<?xml version="1.0" encoding="utf-8"?>
<tmx version="1.4">
<header creationtool="XXXXXXXX" creationtoolversion="100" o-tmf="XXXXXXXX" datatype="xml" segtype="sentence" adminlang="en-GB" srclang="en-GB" creationdate="XXXXXXXX" creationid="XXXXXXXX">
<prop type="x-Note:SingleString"></prop>
<prop type="x-Recognizers">RecognizeAll</prop>
<prop type="x-IncludesContextContent">True</prop>
<prop type="x-TMName">XXXXXXXX</prop>
<prop type="x-TokenizerFlags">DefaultFlags</prop>
<prop type="x-WordCountFlags">DefaultFlags</prop>
</header>
<body>
<tu creationdate="XXXXXXXX" creationid="XXXXXXXX" changedate="XXXXXXXX" changeid="XXXXXXXX" lastusagedate="XXXXXXXX" usagecount="1">
<prop type="x-LastUsedBy">XXXXXXXX</prop>
<prop type="x-Context">0, 0</prop>
<prop type="x-Origin">TM</prop>
<prop type="x-ConfirmationLevel">Translated</prop>
<prop type="x-StructureContext:MultipleString">sdl:cdata</prop>
<prop type="x-Note:SingleString">XXXXXXXX</prop>
<tuv xml:lang="en-GB">
<seg><bpt i="1" type="14" x="1" />Coded glass plate<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
<tuv xml:lang="lt-LT">
<seg><bpt i="1" type="14" x="1" />YYYYYYYYYYYYY<ept i="1" /><ph x="4" type="33" /></seg>
</tuv>
</tu>
</body>
</tmx>
How do I extract the string from <seg>
element when it contains serialized tags?
The best approach I found is to convert the parent child to a string, passing parameter 'encoding=str' to avoid step of decoding bytes-like object to string and preserve UTF-8 symbols. Then regex out the tags from the resulting string.
import re
from lxml import etree as ET
root = ET.parse('seg.xml').getroot()
seg_elem = root.find('body').findall('tu')[0].findall('tuv')[0].find('seg')
seg_string = ET.tostring(seg_elem, encoding=str)
# Regex to strip <seg> tags
seg_pattern = '(?<=<seg>).*?(?=</seg>)'
# Strip <seg> tags
final_string = re.search(seg_pattern, seg_string).group()