Search code examples
pythonxml-parsingminidom

Get tag value using minidom in dynamic tree


I have a xml file which doesn't have the same tags every time in some deep levels.

For example, this is a part of the xml structure, where <openie> is located in

root > document > sentences > sentence > openie

and I want to get values from <text> tag for each sentence

<openie>
  <triple confidence="1.000">
    <subject begin="1" end="2">
      <text>customer</text>
      <lemma>customer</lemma>
    </subject>
    <relation begin="2" end="3">
      <text>enters</text>
      <lemma>enter</lemma>
    </relation>
    <object begin="3" end="6">
      <text>their order number</text>
      <lemma>they order number</lemma>
    </object>
  </triple>
</openie>

I have started with this approach but I got stuck at the point where the XML has different tags, ie. subject, relation and object. And the structure in each openie can change, for example there can be some other tag along with these three I mentioned and it also can have <text> tag.

from xml.dom import minidom

def parse_xml():
    xmldoc = minidom.parse('./tmp/nlp_output.xml')
    sentencesNode = xmldoc.getElementsByTagName('sentences')
    for sentenceNode in sentencesNode:
        for openIeNode in sentenceNode.childNodes:
            for tripleNode in openIeNode.childNodes:
              #what now?

Solution

  • In the context of your problem need, which is

    • I want to get values from < text > tag for each sentence ?

    There is no need to keep track of different tags or child-nodes every time. Here is simple workaround:

    from xml.dom import minidom
    xml_doc = minidom.parse('./tmp/nlp_output.xml')
    
    #  To get Number of available tags, you want to search :
    item_list = xml_doc.getElementsByTagName('text')
    print("Number of text-tags:", len(item_list), '\n')
    
    for text_Elem in item_list:
        text_value = ''.join([node.data for node in text_Elem.childNodes])
        print('Required Value:', text_value)
    

    By using this technique, you'll get Exact tag value as required in your case. To learn in detail about XML-parsing visit reference: How-to-Parse-XML-in-Python.


    Here is output for given XML-File i.e nlp_output.xml,

     - Number of text-tags: 3 
    
     - Required Value: customer
    
     - Required Value: enters 
    
     - Required Value: their order number