I have a XML file (actually is a xliff file) where a node has 2 children nodes with identical substructure (which is not known a priori, can be very complex and changes for each <trans-unit>
). I'm working with python and lxml library... Example:
<trans-unit id="tu4" xml:space="preserve">
<seg-source>
<mrk mid="0" mtype="seg">
<g id="1">...</g>
<g id="2">...</g>
<g id="3">...</g>
<bx id="7"/>...
</mrk>
<mrk mid="1" mtype="seg">...</mrk>
<mrk mid="2" mtype="seg">...
<ex id="7"/>
<g id="8"> FROM HERE </g>
</mrk>
</seg-source>
<target xml:lang="en">
<mrk mid="0" mtype="seg">
<g id="1">...</g>
<g id="2">...</g>
<g id="3">...</g>
<bx id="7"/>...
</mrk>
<mrk mid="1" mtype="seg">...</mrk>
<mrk mid="2" mtype="seg">...
<ex id="7"/>
<g id="8"> TO HERE </g>
</mrk>
</target>
</trans-unit>
As you can see, the 2 nodes <seg-source>
and <target>
have exactly the same sub-structure. My goal is to navigate to each node of <seg-source>
, get the text and the tail of that node (and I know how to do that with xpath), translate them and finally (and THIS IS what I don't know how to do) assign to the corresponding node in the <target>
the translation...
In other words... suppose I get the node "FROM HERE"... how can I get the node "TO HERE"?.
if you want to pair them all you could just zip the nodes together so you can access the matching codes from each:
from lxml import etree
tree = etree.fromstring(x)
nodes = iter(tree.xpath("//*[self::seg-source or self::target]"))
for seq, tar in zip(nodes, nodes):
# each node will be the matching nodes from each seq-source and target
print(seq.xpath(".//*"))
print(tar.xpath(".//*"))
Since there are only two in any/each trans-unit
you can just use nodes = iter(tree.xpath("//trans-unit/*"))
so the names of the nodes inside don't matter.
nodes = iter(tree.xpath("/trans-unit/*"))
for seq, tar in zip(nodes, nodes):
print(seq.xpath(".//*"))
print(tar.xpath(".//*"))
If we run the code on your sample and print each id node you can see the output gets one from each:
In [2]: from lxml import etree
In [3]: tree = etree.fromstring(x)
In [4]: nodes = iter(tree.xpath("//trans-unit/*"))
In [5]: for seq, tar in zip(nodes, nodes):
...: print(seq.xpath(".//g[@id='8']/text()"))
...: print(tar.xpath(".//g[@id='8']/text()"))
...:
[' FROM HERE ']
[' TO HERE ']
Each node is the corresponding node from each child of trans-unit:
In [7]: for seq, tar in zip(nodes, nodes):
...: print(seq.tag, tar.tag)
...: for n1, n2 in zip(seq.xpath(".//*"),tar.xpath(".//*")):
...: print(n1.tag, n2.tag)
...:
('seg-source', 'target')
('mrk', 'mrk')
('g', 'g')
('g', 'g')
('g', 'g')
('bx', 'bx')
('mrk', 'mrk')
('mrk', 'mrk')
('ex', 'ex')
('g', 'g')