Search code examples
pythonxml-parsingminidom

Python minidom XML parser - ignore child tags


I have a XML file which looks like:

<tag1>
    <tag2>
        I am too good <italic>to be true</italic>
    </tag2>
</tag1>

Now, When I want to extract the data within the "tag2" tags, then assuming the XML file is read into the "XML_data" variable:

XML_data.getElementsByTagName('tag1')[0].getElementsByTagName('tag2')[0].childNodes[0].data
evaluates to "I am too good"
and 
XML_data.getElementsByTagName('tag1')[0].getElementsByTagName('tag2')[0].getElementsByTagName('italic')[0].childNodes[0].data
evaluates to "to be true"

What I want is to be able to extract the whole chunk within tag2, by ignoring the italic tags. i.e, I want my out put to be

"I am too good <italic>to be true</italic>"

How do I do this? Please help.


Solution

  • Finally used ElementTree

    import xml.etree.ElementTree as ET
    import re
    
    def extractTextFromElement(elementName, stringofxml):
        tree = ET.fromstring(stringofxml)
        for child in tree.getiterator():
            if child.tag == elementName:
                len = ET.tostring(child)
                return re.sub(r'<.*?>', '', len)
    
    
    usage: extractTextFromElement('tag2', XML_data)