I am trying to pull the numbers out the tags of this XML file:
<start-date type="date">1980-12-12</start-date>
<end-date type="date">2018-05-04</end-date>
<data type="array">
<datum type="array">
<datum type="date">2018-05-04</datum>
<datum type="float">178.25</datum>
<datum type="float">184.25</datum>
<datum type="float">178.17</datum>
<datum type="float">183.83</datum>
<datum type="float">56201317.0</datum>
<datum type="float">0.0</datum>
<datum type="float">1.0</datum>
<datum type="float">178.25</datum>
<datum type="float">184.25</datum>
<datum type="float">178.17</datum>
<datum type="float">183.83</datum>
<datum type="float">56201317.0</datum>
</datum>
Using this script:
#Test Parser
from xml.dom import minidom
xmldoc = minidom.parse('AAPL.xml')
itemlist = xmldoc.getElementsByTagName('datum')
print(len(itemlist))
print(itemlist[0].attributes['type'].value)
for s in itemlist:
print(s.attributes['type'].value)
But the output returns what type is = to so it returns float, array, and date over and over but I need the numbers inside the tag datum Like this:
<datum type="float">178.25</datum>
I need the 178.25 value How can I change my script to do this This is my first parser project so I am a bit lost here. Any help is appreciated
The fact you did not present a valid XML data (because there is no root element), suggests different ways to resolve your problem. But all these methods are very similar and rely on the use of nodeValue
. Below is a solution.
We suppose we have your valid XML file (and I know you have one):
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('AAPL.xml')
From there, we will look for elements which have datum as a tag name:
>>> datums = xmldoc.getElementsByTagName('datum')
datums
is a list of all the XML document objects which have have the tag name datum; and this actually include the one you do not need: their parent node <datum type="array">
.
We thus loop over these datums
(and exclude the parent one) to display their text.
Note that the text 178.25 below is a child node of datum
element.
<datum type="float">178.25</datum>
That is why we need to loop as follows:
>>> for datum in datums:
... if datum.getAttribute('type') != 'array': #exclude the parent datum
... print(datum.childNodes[0].nodeValue)
As datum has a list of child nodes which consists only of one element (the text element) we need to write datum.childNodes[0]
to access it. Once we position ourselves in that text element, we can read its content by invoking nodeValue
mentioned previously.
And here is the output:
>>> from xml.dom import minidom
>>> xmldoc = minidom.parse('AAPL.xml')
>>> datums = xmldoc.getElementsByTagName('datum')
>>> for datum in datums:
... if datum.getAttribute('type') != 'array':
... print(datum.childNodes[0].nodeValue)
...
2018-05-04
178.25
184.25
178.17
183.83
56201317.0
0.0
1.0
178.25
184.25
178.17
183.83
56201317.0