Search code examples
pythonxmlparsingcomments

How do i parse a xml comment properly in python


i have been using Python recently and i want to extract information from a given xml file. The problem is that the information is really badly stored, in a format like this

<Content>
   <tags>
   ....
   </tags>
<![CDATA["string1"; "string2"; ....
]]>
</Content>

I can not post the entire data here, since it is about 20.000 lines. I just want to recieve the list containing ["string1", "string2", ...] and this is the code i have been using so far:

import xml.etree.ElementTree as ET

tree = ET.parse(xmlfile)
for node in tree.iter('Content'):
    print (node.text)

However my output is none. How can i recieve the comment data? (again, I am using Python)


Solution

  • You'll want to create a SAX based parser instead of a DOM based parser. Especially with a document as large as yours.

    A sax based parser requires you to write your own control logic in how data is stored. It's more complicated than simply loading it into a DOM, but much faster as it loads line by line and not the entire document at once. Which gives it the advantage that it can deal with squirrely cases like yours with comments.

    When you build your handler, you'll probably want to use the LexicalHandler in your parser to pull out those comments.

    I'd give you a working example on how to build one, but it's been a long time since I've done it myself. There's plenty of guides on how to build a sax based parser online, and will defer that discussion to another thread.