Search code examples
pythonxmlsax

Python SAX parser fails to handle  character


I try to parse XML file via xml.sax.handler.ContentHandler subclass. The parser fails at the following line:

<desc>&#18;some_text&#15;</desc>

and I get the following error:

xml.sax._exceptions.SAXParseException: test.xml:687338:17: reference to invalid character number

The spec(http://www.w3.org/TR/xml/#sec-references) says that the characters &#18; and &#15; are valid. So is there a bug in a parser or I'm doing something wrong?


Solution

  • Although you can encode these characters, they're still at best "frowned upon". See http://www.w3.org/TR/xml/#NT-Char for a list of "bad" characters. Then, see this 1.1 spec as well, which adds some back as allowed in some cases, as "restricted" characters.

    If the text legitimately should be able to include these characters, it's wise to encode it first, e.g., with base64 encoding. The receiver thus gets well-formed XML (for XML 1.1, it's not always required but that will make it compatible with 1.0).

    I had to deal with externally-supplied invalid XML myself once before, where I had no control over the sender. It's pretty messy. In my case I could rely on certain patterns, and hence use regular expressions to "patch away" improprieties, but this is a hack: a workaround done out of desperation, instead of a proper fix.

    (In my case I had to handle things that would have tripped up even an XML 1.1 parser—the sender was just plain broken, a bunch of perl code using faulty regexp's and some literal <foo> type strings to generate pretend-XML—so I never looked any further.)