Search code examples
pythonxmlminidom

How to find a specific tag in an XML file and then access its parent tag with Python and minidom


I'm trying to write some code that will search through an XML file of articles for a particular DOI contained within a tag. When it has found the correct DOI I'd like it to then access the <title> and <abstract> text for the article associated with that DOI.

My XML file is in this format:

<root>
 <article>
  <number>
   0 
  </number>
  <DOI>
   10.1016/B978-0-12-381015-1.00004-6 
  </DOI>
  <title>
   The patagonian toothfish biology, ecology and fishery. 
  </title>
  <abstract>
   lots of abstract text
  </abstract>
 </article>
 <article>
  ...All the article tags as shown above...
 </article>
</root>

I'd like the script to find the article with the DOI 10.1016/B978-0-12-381015-1.00004-6 (for example) and then for me to be able to access the <title> and <abstract> tags within the corresponding <article> tag.

So far I've tried to adapt code from this question:

from xml.dom import minidom

datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)   

#looking for: 10.1016/B978-0-12-381015-1.00004-6

matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']

for i in range(len(matchingNodes)):
    DOI = str(matchingNodes[i])
    print DOI

But I'm not entirely sure what I'm doing!

Thanks for any help.


Solution

  • imho - just look it up in the python docs! try this (not tested):

    from xml.dom import minidom
    
    xmldoc = minidom.parse(datasource)   
    
    def get_xmltext(parent, subnode_name):
        node = parent.getElementsByTagName(subnode_name)[0]
        return "".join([ch.toxml() for ch in node.childNodes])
    
    matchingNodes = [node for node in xmldoc.getElementsByTagName("article")
               if get_xmltext(node, "DOI") == '10.1016/B978-0-12-381015-1.00004-6']
    
    for node in matchingNodes:
        print "title:", get_xmltext(node, "title")
        print "abstract:", get_xmltext(node, "abstract")