Search code examples
pythonxmlminidom

Python minidom extract text from XML


Python beginner here. I am trying to parse the structure of an XML file, using minidom. The XML structure is like this:

...
    <Node Precode="1">
        <Text Id="9">sometext 1</Text>
    </Node>
...

I am trying to add all node elements into a list, using a recursive function (not of my own design, found on stackoverflow and adapted to my needs). The current status is this:

from xml.dom import minidom
list_to_write=[]
def parse_node(root):
    if root.childNodes:
        for node in root.childNodes:
            if node.nodeType == node.ELEMENT_NODE:
                new_node = [node.tagName,node.parentNode.tagName,node.getAttribute('Precode'),node.attributes.items()]

                list_to_write.append(new_node)

                parse_node(node)
    return list_to_write

How can I extract the "sometext" text and add it as an element in the list_to_write list?


Solution

  • I assume you have a nodes.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <Node >
            <Text Id="9">sometext 1</Text>
        </Node>
        <Node >
            <Text Id="9">sometext 2</Text>
        </Node>
        <Node >
            <Text Id="9">sometext 3</Text>
        </Node>
        <Node >
            <Text Id="9">sometext 4</Text>
        </Node>
        <Node >
            <Text Id="9">sometext 5</Text>
        </Node>
        <Node>
            <Text Id="9">sometext 6</Text>
        </Node>
        <Node >
            <Text Id="9">sometext 7</Text>
        </Node>
    </root>
    

    And you can take the bellow code to get the texts :

    from xml.dom import minidom
    
    list_to_write=[]
    def parse_node():
        doc = minidom.parse("nodes.xml")
        root = doc.documentElement
    
        nodes = root.getElementsByTagName("Node")
        print doc
        for node in nodes:
            list_to_write.append(node.getElementsByTagName("Text")[0].childNodes[0].nodeValue)
    
    parse_node()
    
    print (list_to_write)
    

    The result is:

    [u'sometext 1', u'sometext 2', u'sometext 3', u'sometext 4', u'sometext 5', u'sometext 6', u'sometext 7']