Search code examples
pythonxmlxpathlxmlxliff

Get the corresponding XML nodes with xpath


I have a XML file (actually is a xliff file) where a node has 2 children nodes with identical substructure (which is not known a priori, can be very complex and changes for each <trans-unit>). I'm working with python and lxml library... Example:

<trans-unit id="tu4" xml:space="preserve">
    <seg-source>
        <mrk mid="0" mtype="seg">
            <g id="1">...</g>
            <g id="2">...</g>
            <g id="3">...</g>
            <bx id="7"/>...
        </mrk>
        <mrk mid="1" mtype="seg">...</mrk>
        <mrk mid="2" mtype="seg">...
            <ex id="7"/>
            <g id="8"> FROM HERE </g>
        </mrk>
   </seg-source>
   <target xml:lang="en">
        <mrk mid="0" mtype="seg">
            <g id="1">...</g>
            <g id="2">...</g>
            <g id="3">...</g>
            <bx id="7"/>...
        </mrk>
        <mrk mid="1" mtype="seg">...</mrk>
        <mrk mid="2" mtype="seg">...
            <ex id="7"/>
            <g id="8"> TO HERE </g>
        </mrk>
   </target>
</trans-unit>

As you can see, the 2 nodes <seg-source> and <target> have exactly the same sub-structure. My goal is to navigate to each node of <seg-source>, get the text and the tail of that node (and I know how to do that with xpath), translate them and finally (and THIS IS what I don't know how to do) assign to the corresponding node in the <target> the translation...

In other words... suppose I get the node "FROM HERE"... how can I get the node "TO HERE"?.


Solution

  • if you want to pair them all you could just zip the nodes together so you can access the matching codes from each:

    from lxml import etree
    
    tree = etree.fromstring(x)
    nodes = iter(tree.xpath("//*[self::seg-source or self::target]"))
    for seq, tar in zip(nodes, nodes):
        # each node will be the matching nodes from each seq-source and target
        print(seq.xpath(".//*"))
        print(tar.xpath(".//*"))
    

    Since there are only two in any/each trans-unit you can just use nodes = iter(tree.xpath("//trans-unit/*")) so the names of the nodes inside don't matter.

    nodes = iter(tree.xpath("/trans-unit/*"))
    for seq, tar in zip(nodes, nodes):
        print(seq.xpath(".//*"))
        print(tar.xpath(".//*"))
    

    If we run the code on your sample and print each id node you can see the output gets one from each:

    In [2]: from lxml import etree
    
    In [3]: tree = etree.fromstring(x)
    
    In [4]: nodes = iter(tree.xpath("//trans-unit/*"))
    
    In [5]: for seq, tar in zip(nodes, nodes):
       ...:         print(seq.xpath(".//g[@id='8']/text()"))
       ...:         print(tar.xpath(".//g[@id='8']/text()"))
       ...:     
    [' FROM HERE ']
    [' TO HERE ']
    

    Each node is the corresponding node from each child of trans-unit:

    In [7]: for seq, tar in zip(nodes, nodes):
       ...:         print(seq.tag, tar.tag)
       ...:         for n1, n2 in zip(seq.xpath(".//*"),tar.xpath(".//*")):
       ...:                 print(n1.tag, n2.tag)
       ...:         
    ('seg-source', 'target')
    ('mrk', 'mrk')
    ('g', 'g')
    ('g', 'g')
    ('g', 'g')
    ('bx', 'bx')
    ('mrk', 'mrk')
    ('mrk', 'mrk')
    ('ex', 'ex')
    ('g', 'g')