Search code examples
pythonxmlparsingexpat-parser

XML parsing expat in python handling data


I am attempting to parse an XML file using python expat. I have the following line in my XML file:

<Action>&lt;fail/&gt;</Action>

expat identifies the start and end tags but converts the & lt; to the less than character and the same for the greater than character and thus parses it like this:

outcome:

START 'Action'
DATA '<'
DATA 'fail/'
DATA '>'
END 'Action'

instead of the desired:

START 'Action'
DATA '&lt;fail/&gt;'
END 'Action'

I would like to have the desired outcome, how do I prevent expat from messing up?


Solution

  • expat does not mess up, &lt; is simply the XML encoding for the character <. Quite to the contrary, if expat would return the literal &lt;, this would be a bug with respect to the XML spec. That being said, you can of course get the escaped version back by using xml.sax.saxutils.escape:

    >>> from xml.sax.saxutils import escape
    >>> escape("<fail/>")
    '&lt;fail/&gt;'
    

    The expat parser is also free to report all string data in whatever chunks it seems fit, so you have to concatenate them yourself.