I have a xml file which doesn't have the same tags every time in some deep levels.
For example, this is a part of the xml structure, where <openie>
is located in
root > document > sentences > sentence > openie
and I want to get values from <text>
tag for each sentence
<openie>
<triple confidence="1.000">
<subject begin="1" end="2">
<text>customer</text>
<lemma>customer</lemma>
</subject>
<relation begin="2" end="3">
<text>enters</text>
<lemma>enter</lemma>
</relation>
<object begin="3" end="6">
<text>their order number</text>
<lemma>they order number</lemma>
</object>
</triple>
</openie>
I have started with this approach but I got stuck at the point where the XML has different tags, ie. subject
, relation
and object
. And the structure in each openie
can change, for example there can be some other tag along with these three I mentioned and it also can have <text>
tag.
from xml.dom import minidom
def parse_xml():
xmldoc = minidom.parse('./tmp/nlp_output.xml')
sentencesNode = xmldoc.getElementsByTagName('sentences')
for sentenceNode in sentencesNode:
for openIeNode in sentenceNode.childNodes:
for tripleNode in openIeNode.childNodes:
#what now?
In the context of your problem need, which is
There is no need to keep track
of different tags
or child-nodes every time. Here is simple workaround:
from xml.dom import minidom
xml_doc = minidom.parse('./tmp/nlp_output.xml')
# To get Number of available tags, you want to search :
item_list = xml_doc.getElementsByTagName('text')
print("Number of text-tags:", len(item_list), '\n')
for text_Elem in item_list:
text_value = ''.join([node.data for node in text_Elem.childNodes])
print('Required Value:', text_value)
By using this technique, you'll get Exact tag value
as required in your case. To learn in detail about XML-parsing visit reference: How-to-Parse-XML-in-Python.
Here is output for given XML-File i.e nlp_output.xml
,
- Number of text-tags: 3
- Required Value: customer
- Required Value: enters
- Required Value: their order number