Search code examples
pythonxmllistfor-loopcomparison

Comparing XML node values inside a parent tag to a sequence of tuples as an element in a list


I have an XML file as below:

<?xml version="1.0"?>
<root>
<things count="720">
  <tokens>
   <token>
    <fruit>mango</fruit>
   </token>
  <token>
   <fruit>apple</fruit>
  </token>
 </tokens>
 <indices> ... </indices>
</things>

<things count="484">
 <tokens>
  <token>
   <fruit>mango</fruit>
  </token>
  <token>
  <plant>coconut</plant>
  </token>
 </tokens>
 <indices> ... </indices>
</things>

<things count="455">
 <tokens>
  <token>
   <fruit>mango</fruit>
  </token>
  <token>
   <fruit>apple</fruit>
  </token>
  <token>
   <livingthing>
    coconut
    <subtoken>
     <fruit>cocunut</fruit>
     <fruit>drycocunut</fruit>
    </subtoken>
   </livingthing>
  </token>
 </tokens>
 <indices> ... </indices>
</things>

...

</root>

which I want to compare it to a list:

[(('mango', 'FRUIT'), ('coconut', 'PLANT')),
 (('mango', 'PLANT'), ('coconut', 'PLANT')),
 ...
 (('apple', 'PLANT'), ('orange', 'FRUIT'), ('coconut', 'PLANT')),
 ...
 (('mango', 'FRUIT'), ('apple', 'FRUIT'), ('coconut', 'LIVING')),
 (('apple', 'PLANT'), ('orange', 'LIVING'), ('coconut', 'PLANT')), 
 ...
]

The mapping between the xml node (tags) and the second element of the tuple inside any list element is:

  • fruit --> FRUIT
  • plant --> PLANT
  • living --> LIVING
  • livingthing --> LIVING

Now, the goal is to iterate over the XML things element one by one and find if there is any match in the list. For this, we have to see the corresponding tag using the above mapping and then compare if the text are the same in sequence. If there is a match, we need to return the order number of the corresponding things element in the xml file.

I have tried writing a for-loop which iterates over the XML file elements (children) to locate the relevant tags and then use an inner for-loop to iterate over each of the list elements for comparison. Once, a match is found both loops should terminate. As of now, my code only works for some cases. To handle more complex or edge-cases, the code is getting either too hard-coded or complicated.

Hence, a fresh approach to this problem would be welcome.

from lxml import etree 
doc = etree.parse(<path_to_xml_file>)
root = doc.getroot()

numThings= len(root.getchildren())

for i in range(numThings):
    toks = root[i]

    numTokens = len(toks.getchildren())
    for j in range(numTokens):

        tok = toks[j]
        numToks = len(tok.getchildren())

        for k in range(numToks):
            t = tok[k]
            numVals = len(t.getchildren())
            if t.tag != 'indices':

                flagMatch = False
                for tupseq in lstTupSeq:
                    for l in range(len(tupseq)):
                        te = tupseq[l]

                        v = t[l]
                        if te[0] == v.text and te[1].lower() in v.tag:
                            flagMatch = True
                        else:
                            flagMatch = False
                            break;
                    if flagMatch:
                        print(tupseq, i, j, k)
                        break;

The expected output of the comparison should be the order number of the match in the xml file. In the above example, it should return an output of 3 as the 3rd element in the XML file (with things count="455") was found to be matching to the list element "(('mango', 'FRUIT'), ('apple', 'FRUIT'), ('coconut', 'LIVING'))"


Solution

  • Here is a solution, let me know if it helped.

    from lxml import etree
    
    doc = etree.parse('scratch.xml')
    root = doc.getroot()
    things = {}
    compare_list = [
        (('mango', 'FRUIT'), ('coconut', 'PLANT')),
        (('mango', 'PLANT'), ('coconut', 'PLANT')),
        (('apple', 'PLANT'), ('orange', 'FRUIT'), ('coconut', 'PLANT')),
        (('mango', 'FRUIT'), ('apple', 'FRUIT'), ('coconut', 'LIVING')),
        (('apple', 'PLANT'), ('orange', 'LIVING'), ('coconut', 'PLANT')),
    ]
    
    def func():
        # for each <things> tag
        for child in root.getchildren():
            l = []
            for node in child:
    
                # if the node tag inside <things> child is 'tokens'
                if node.tag == 'tokens':
    
                    # for each 'token' in 'tokens'
                    for token in node:
    
                        # for each tag inside 'token'
                        for item in token:
    
                            # store the tag name and text into a list
                            if item.tag == 'livingthing':
                                l.append((item.text, 'LIVING'))
                            else:
                                l.append((item.text, item.tag.upper()))
    
                            # convert the list into a tuple and checks if there is a similar tuple in compare_list
                            if tuple(l) in compare_list:
                                # return things count if found
                                return child.attrib['count']
    
    print(func())
    

    The output using the xml you provided is:

    484
    

    It prints the first match found.