Search code examples
pythonxmlparsinglxmlxliff

parsing xliff file using lxml library


I'm not able to parse this xliff fragment:

<source>text1 <g id="1">text2</g> text3 <x id="2"/><x id="3"/>text4</source>

I would like to have a iterative method which runs on the source tag and fill something like

parsed_source[0]='text1'
parsed_source[1]='<g id="1">text2</g>'
parsed_source[2]='text3'
parsed_source[3]='<x id="2"/>'
parsed_source[4]='<x id="3"/>'
parsed_source[5]='text4'

So that I can iterate again on the xml fragment [1], [3] and [4] if needed...

Using lxml for example:

from lxml import etree
tree = etree.iterparse('aFile.xlf')
for action, elem in tree:
    print("%s: %s %s" % (action, elem.tag, elem.text))

I get something similar to:

end: source text1
end: g text2
end: x None
end: x None

And I'm not able to parse text3 and text4...How can I do that? Thanks


Solution

  • You need to take the tail property (the text following the element) into account. Read about it here: https://lxml.de/tutorial.html#elements-contain-text.

    The following snippet (a slight modification of your code) demonstrates it:

    from lxml import etree
     
    tree = etree.iterparse('aFile.xlf')
    for action, elem in tree:
        print("%s: %s %s %s" % (action, elem.tag, elem.text, elem.tail))
    

    Output:

    end: g text2  text3 
    end: x None None
    end: x None text4
    end: source text1  None