I have an XML file as below:
<?xml version="1.0"?>
<root>
<things count="720">
<tokens>
<token>
<fruit>mango</fruit>
</token>
<token>
<fruit>apple</fruit>
</token>
</tokens>
<indices> ... </indices>
</things>
<things count="484">
<tokens>
<token>
<fruit>mango</fruit>
</token>
<token>
<plant>coconut</plant>
</token>
</tokens>
<indices> ... </indices>
</things>
<things count="455">
<tokens>
<token>
<fruit>mango</fruit>
</token>
<token>
<fruit>apple</fruit>
</token>
<token>
<livingthing>
coconut
<subtoken>
<fruit>cocunut</fruit>
<fruit>drycocunut</fruit>
</subtoken>
</livingthing>
</token>
</tokens>
<indices> ... </indices>
</things>
...
</root>
which I want to compare it to a list:
[(('mango', 'FRUIT'), ('coconut', 'PLANT')),
(('mango', 'PLANT'), ('coconut', 'PLANT')),
...
(('apple', 'PLANT'), ('orange', 'FRUIT'), ('coconut', 'PLANT')),
...
(('mango', 'FRUIT'), ('apple', 'FRUIT'), ('coconut', 'LIVING')),
(('apple', 'PLANT'), ('orange', 'LIVING'), ('coconut', 'PLANT')),
...
]
The mapping between the xml node (tags) and the second element of the tuple inside any list element is:
Now, the goal is to iterate over the XML things element one by one and find if there is any match in the list. For this, we have to see the corresponding tag using the above mapping and then compare if the text are the same in sequence. If there is a match, we need to return the order number of the corresponding things element in the xml file.
I have tried writing a for-loop which iterates over the XML file elements (children) to locate the relevant tags and then use an inner for-loop to iterate over each of the list elements for comparison. Once, a match is found both loops should terminate. As of now, my code only works for some cases. To handle more complex or edge-cases, the code is getting either too hard-coded or complicated.
Hence, a fresh approach to this problem would be welcome.
from lxml import etree
doc = etree.parse(<path_to_xml_file>)
root = doc.getroot()
numThings= len(root.getchildren())
for i in range(numThings):
toks = root[i]
numTokens = len(toks.getchildren())
for j in range(numTokens):
tok = toks[j]
numToks = len(tok.getchildren())
for k in range(numToks):
t = tok[k]
numVals = len(t.getchildren())
if t.tag != 'indices':
flagMatch = False
for tupseq in lstTupSeq:
for l in range(len(tupseq)):
te = tupseq[l]
v = t[l]
if te[0] == v.text and te[1].lower() in v.tag:
flagMatch = True
else:
flagMatch = False
break;
if flagMatch:
print(tupseq, i, j, k)
break;
The expected output of the comparison should be the order number of the match in the xml file. In the above example, it should return an output of 3 as the 3rd element in the XML file (with things count="455") was found to be matching to the list element "(('mango', 'FRUIT'), ('apple', 'FRUIT'), ('coconut', 'LIVING'))"
Here is a solution, let me know if it helped.
from lxml import etree
doc = etree.parse('scratch.xml')
root = doc.getroot()
things = {}
compare_list = [
(('mango', 'FRUIT'), ('coconut', 'PLANT')),
(('mango', 'PLANT'), ('coconut', 'PLANT')),
(('apple', 'PLANT'), ('orange', 'FRUIT'), ('coconut', 'PLANT')),
(('mango', 'FRUIT'), ('apple', 'FRUIT'), ('coconut', 'LIVING')),
(('apple', 'PLANT'), ('orange', 'LIVING'), ('coconut', 'PLANT')),
]
def func():
# for each <things> tag
for child in root.getchildren():
l = []
for node in child:
# if the node tag inside <things> child is 'tokens'
if node.tag == 'tokens':
# for each 'token' in 'tokens'
for token in node:
# for each tag inside 'token'
for item in token:
# store the tag name and text into a list
if item.tag == 'livingthing':
l.append((item.text, 'LIVING'))
else:
l.append((item.text, item.tag.upper()))
# convert the list into a tuple and checks if there is a similar tuple in compare_list
if tuple(l) in compare_list:
# return things count if found
return child.attrib['count']
print(func())
The output using the xml you provided is:
484
It prints the first match found.