Search code examples
xmlencodingutf-8expat-parser

Why does expat reject en dash character as invalid?


In my XML input file I have the following line:

<change beforeWhat="Literacy rate in L2: 50\%–75\%. Informally used" />

That character between 50\% and 75\% is not a hyphen but an en dash.

When I parse in this XML file using expat in Python:

postFixesDoc = minidom.parse('postFixes.xml')

I get the following error:

ExpatError: not well-formed (invalid token): line 35, column 99             

where 35 is the line I quoted above from the XML input file, and 99 is the column of the % right before the en dash.

If I replace the en dash with &#x2013;, then the error goes away and everything works fine. So I have a workaround. But I don't understand why this is happening.

What I've read about the problem -- e.g. Python’s minidom, xml, and illegal unicode characters -- tells me that some characters that are legal in UTF-8 aren't legal in XML, and points me to section 2.2 of the XML Spec on legal character ranges. But the definition for Char there includes the range #x20-#xD7FF. And #x2013 obviously falls within that range. So what's the problem?

FWIW, the XML input file begins with a UTF-8 declaration,

<?xml version="1.0" encoding="utf8"?>

and I used a hex editor to verify that the en dash is represented by the character sequence E2 80 93, which is the correct UTF-8 encoding for en dash. So why won't expat accept it? Is this a bug in expat?


Solution

  • Aha...

    This Python doc footnote, though it applies to a different situation, alerted me to the fact that my XML encoding declaration was wrong:

    The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.

    For some reason I was under the impression that utf8 was acceptable too. But when I changed the declaration to

    <?xml version="1.0" encoding="utf-8"?>
    

    the error went away!