I'm trying to write some code that will search through an XML file of articles for a particular DOI contained within a tag. When it has found the correct DOI I'd like it to then access the <title>
and <abstract>
text for the article associated with that DOI.
My XML file is in this format:
<root>
<article>
<number>
0
</number>
<DOI>
10.1016/B978-0-12-381015-1.00004-6
</DOI>
<title>
The patagonian toothfish biology, ecology and fishery.
</title>
<abstract>
lots of abstract text
</abstract>
</article>
<article>
...All the article tags as shown above...
</article>
</root>
I'd like the script to find the article with the DOI 10.1016/B978-0-12-381015-1.00004-6 (for example) and then for me to be able to access the <title>
and <abstract>
tags within the corresponding <article>
tag.
So far I've tried to adapt code from this question:
from xml.dom import minidom
datasource = open('/Users/philgw/Dropbox/PW-Honours-Project/Code/processed.xml')
xmldoc = minidom.parse(datasource)
#looking for: 10.1016/B978-0-12-381015-1.00004-6
matchingNodes = [node for node in xmldoc.getElementsByTagName("DOI") if node.firstChild.nodeValue == '10.1016/B978-0-12-381015-1.00004-6']
for i in range(len(matchingNodes)):
DOI = str(matchingNodes[i])
print DOI
But I'm not entirely sure what I'm doing!
Thanks for any help.
imho - just look it up in the python docs! try this (not tested):
from xml.dom import minidom
xmldoc = minidom.parse(datasource)
def get_xmltext(parent, subnode_name):
node = parent.getElementsByTagName(subnode_name)[0]
return "".join([ch.toxml() for ch in node.childNodes])
matchingNodes = [node for node in xmldoc.getElementsByTagName("article")
if get_xmltext(node, "DOI") == '10.1016/B978-0-12-381015-1.00004-6']
for node in matchingNodes:
print "title:", get_xmltext(node, "title")
print "abstract:", get_xmltext(node, "abstract")