Search code examples
c#xmlencodingutf-16

XmlDocument mis-reads UTF-8 'e-acute' character


I'm reading an XML document that contains the é (e acute) character. The document has been saved as UTF-8 and I have confirmed that the character is UTF-8 with a binary file reader (it is c3+a9). However, after processing, the character becomes a three-byte jumble (c3+83+c2).

My guess is that .NET has tried to convert the character(s) to UTF-16 (this is my best guess) or has split the character into one one-byte character and one double-byte UTF-8 character.

I'm loading the document like this:

XmlDocuments document = new XmlDocuments();
document.Load("z:\\source.xml");

How should I be loading this? Should I be reading this through a UTF-8-encoded stream?


[Edit]

I forgot to mention the document I'm loading is declaring itself as UTF-8.

<?xml version="1.0" encoding="utf-8"?>

Solution

  • é is encoded in UTF-8 as C3 A9. Those two bytes are interpreted in the Windows-1252 codepage (aka ANSI codepage or Encoding.Default in .NET) as é. Re-encoding these in UTF-8 gives C3 83 C2 A9, which matches the first three bytes of your "three-byte jumble". It appears that some code somewhere is performing a Windows-1252 bytes -> System.String chars -> UTF-8 bytes conversion.

    I've never seen .NET use the wrong encoding when it's explicitly specified in the XML declaration (XmlDocument.Load should "just work"), so I would suspect that there is a bug in your code.

    How are you determining that it's loading incorrectly? Once it's loaded in .NET, you would see strings, not bytes, so it seems odd to me that you're reporting an incorrect byte sequence, not an incorrect sequence of characters.