Given the input:
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<sms body=". what" />
Where the character after the "." in the body attribute of the sms tag is U+00A0;
I get the error:
XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)
IIUC, the UTF-8 representation of that character is 0xC2 0xA0
per Wikipedia. Sure enough, bytes 72 and 73 of the input are 194 and 160 respectively.
This seems like a bug in XMLParser, or am I missing something?
Thanks to Monty for coming to the rescue on the Pharo User's list:
You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.
Longer explanation:
The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.
File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:
The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.
There is an encoding declaration with a non-UTF-8 encoding.
There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).
So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.