Search code examples
pythonpython-3.xparsingxml-parsingminidom

Python3 Minidom Parse Data Inside Tag


I am trying to pull the numbers out the tags of this XML file:

<start-date type="date">1980-12-12</start-date>
<end-date type="date">2018-05-04</end-date>
<data type="array">
  <datum type="array">
    <datum type="date">2018-05-04</datum>
    <datum type="float">178.25</datum>
    <datum type="float">184.25</datum>
    <datum type="float">178.17</datum>
    <datum type="float">183.83</datum>
    <datum type="float">56201317.0</datum>
    <datum type="float">0.0</datum>
    <datum type="float">1.0</datum>
    <datum type="float">178.25</datum>
    <datum type="float">184.25</datum>
    <datum type="float">178.17</datum>
    <datum type="float">183.83</datum>
    <datum type="float">56201317.0</datum>
  </datum>

Using this script:

#Test Parser

from xml.dom import minidom
xmldoc = minidom.parse('AAPL.xml')
itemlist = xmldoc.getElementsByTagName('datum')

print(len(itemlist))
print(itemlist[0].attributes['type'].value)
for s in itemlist:
    print(s.attributes['type'].value)

But the output returns what type is = to so it returns float, array, and date over and over but I need the numbers inside the tag datum Like this:

<datum type="float">178.25</datum>

I need the 178.25 value How can I change my script to do this This is my first parser project so I am a bit lost here. Any help is appreciated


Solution

  • The fact you did not present a valid XML data (because there is no root element), suggests different ways to resolve your problem. But all these methods are very similar and rely on the use of nodeValue. Below is a solution.

    We suppose we have your valid XML file (and I know you have one):

    >>> from xml.dom import minidom
    >>> xmldoc = minidom.parse('AAPL.xml')
    

    From there, we will look for elements which have datum as a tag name:

    >>> datums = xmldoc.getElementsByTagName('datum')
    

    datums is a list of all the XML document objects which have have the tag name datum; and this actually include the one you do not need: their parent node <datum type="array">.

    We thus loop over these datums (and exclude the parent one) to display their text.

    Note that the text 178.25 below is a child node of datum element.

    <datum type="float">178.25</datum>
    

    That is why we need to loop as follows:

    >>> for datum in datums:
    ...     if datum.getAttribute('type') != 'array': #exclude the parent datum
    ...             print(datum.childNodes[0].nodeValue)
    

    As datum has a list of child nodes which consists only of one element (the text element) we need to write datum.childNodes[0] to access it. Once we position ourselves in that text element, we can read its content by invoking nodeValue mentioned previously.

    And here is the output:

    >>> from xml.dom import minidom
    >>> xmldoc = minidom.parse('AAPL.xml')
    >>> datums = xmldoc.getElementsByTagName('datum')
    >>> for datum in datums:
    ...     if datum.getAttribute('type') != 'array':
    ...             print(datum.childNodes[0].nodeValue)
    ... 
    2018-05-04
    178.25
    184.25
    178.17
    183.83
    56201317.0
    0.0
    1.0
    178.25
    184.25
    178.17
    183.83
    56201317.0