I already posted an xml-utf16 question Emacs displays chinese character if I open xml file but now I would like to unterstand why this kind of problems arises. Maybe, if I have a deeper understanding I can better cope with this kind of problems.
Concretely, I got an xml-file which was encoded with utf16. I openend the file from my windows xp PC with emacs (notepad, firefox) and figure (A) was displayed (firefox says: not well-formed). Obviously, the file was exported with encoding utf16. (B) displays the hexadecimal version. (C) displays the xml files after conversion with emacs (revert-buffer-with-coding-system) to utf-8. I also converted the xml-utf16 file with Perl to utf8. The result is displayed in (D).
My questions:
Thanks for patience.
There are various things you do not seem to know:
This will be just a link to “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky.
TL;DR: Encodings are bijective partial functions that map byte sequences to character sequences and back again. Unicode is a large list of characters which each have a number (the codepoint). Various encodings are used to map these codepoints to bytes:
0xFEFF
or 0xFFFE
sorts this out, and one of them precedes every UTF-16 document.Some characters (“control characters”) have no printable interpretation. In your hexdump, unprintable bytes are represented with a .
. Emacs and Vim follow the traditional route of prefixing control codes with ^
, which means that it together with the next character represents a control code. ^@
means the NUL character, while ^H
represents the backspace, and ^D
represents the end of a transmission. You get the ASCII value of the control character by subtracting 0x40
from the ASCII character in the visual representation. \377
is the octal representation for 0xFF
.
The default encoding for XML is UTF-8, because it is backwards-compatible with ASCII. Using any other encoding is unnecessary pain, as is evidenced by this question. Anyway, UTF-16 can be used, if properly declared (which your input tries), but then gets messed up.
Your file has the following parts:
0xFFFE
, which means the first byte is the low byte in the input. ASCII characters are then followed by a NUL byte.0d00 0d0a
. 0d00
is CR
, the carriage return. The second part was meant to be 0a00
, the line feed. Together, they form a Windows line ending. The 0d0a
would be an ASCII CRLF. But this is wrong, because UTF-16 is a two-byte encoding.What happened:
Someone printed out the XML preamble, which was encoded in UTF-16le. The \n
at the end was automatically translated to \r\n
. So 0d00 0a00
became 0d00 0d0a 00
.
This can happen in Perl when you don't decode your input, but encode your output. On Windows, Perl does automatic newline translation, this can be switched off via binmode $fh
.
If your script could fix this error, then it made the same mistake in reverse (translating \r\n
to \n
, and then decoding it).
Such errors can be avoided by decoding all input directly, and encoding it again before you print. Internally, always operate on codepoints, not bytes. In Perl encodings can be added to a filehandle with binmode
, which performs the de- and encoding transparently.