I am attempting to parse an XML file using python expat. I have the following line in my XML file:
<Action><fail/></Action>
expat identifies the start and end tags but converts the & lt; to the less than character and the same for the greater than character and thus parses it like this:
outcome:
START 'Action'
DATA '<'
DATA 'fail/'
DATA '>'
END 'Action'
instead of the desired:
START 'Action'
DATA '<fail/>'
END 'Action'
I would like to have the desired outcome, how do I prevent expat from messing up?
expat does not mess up, <
is simply the XML encoding for the character <
. Quite to the contrary, if expat would return the literal <
, this would be a bug with respect to the XML spec. That being said, you can of course get the escaped version back by using xml.sax.saxutils.escape
:
>>> from xml.sax.saxutils import escape
>>> escape("<fail/>")
'<fail/>'
The expat parser is also free to report all string data in whatever chunks it seems fit, so you have to concatenate them yourself.