Search code examples
pythonxmlpython-3.xminidom

How to get inner content as string using minidom from xml.dom?


I have some text tags in my xml file (pdf converted to xml using pdftohtml from popplers-utils) that looks like this:

<text top="525" left="170" width="603" height="16" font="1">..part of old large book</text>
<text top="546" left="128" width="645" height="16" font="1">with many many pages and some <i>italics text among 'plain' text</i> and more and more text</text>
<text top="566" left="128" width="642" height="16" font="1">etc...</text>

and I can get text envolved with text tag with this sample code:

import string
from xml.dom import minidom
xmldoc = minidom.parse('../test/text.xml')
itemlist = xmldoc.getElementsByTagName('text')

some_tag = itemlist[node_index]
output_text = some_tag.firstChild.nodeValue
# if there is all text inside <i> I can get it by
output_text = some_tag.firstChild.firstChild.nodeValue

# but no if <i></i> wrap only one word of the string

but I can not get "nodeValue" if it contents another tag (<i> or <b>...) inside and can not get object either

What is the best way to get all text as plain string like javascript innerHTML method or recurse into child tags even if they wraps some words and not entire nodeValue?

thanks


Solution

  • **Question: How to get inner content as string using minidom

    This is a Recursive Solution, for instance:

    def getText(nodelist):
        # Iterate all Nodes aggregate TEXT_NODE
        rc = []
        for node in nodelist:
            if node.nodeType == node.TEXT_NODE:
                rc.append(node.data)
            else:
                # Recursive
                rc.append(getText(node.childNodes))
        return ''.join(rc)
    
    
    xmldoc = minidom.parse('../test/text.xml')
    nodelist = xmldoc.getElementsByTagName('text')
    
    # Iterate <text ..>...</text> Node List
    for node in nodelist:
        print(getText(node.childNodes))
    

    Output:

    ..part of old large book
    with many many pages and some italics text among 'plain' text and more and more text
    etc...
    

    Tested with Python: 3.4.2