Search code examples
character-encodingthunderbird

TextEncoder produces UTF-8 instead of request charset encoding


As part of transitioning my Thunderbird extension to Thunderbird 60, I need to switch from using nsIScriptableUnicodeConverter (If you don't know Mozilla, never mind what that is) to the more popular, and multiple-browser-supported, TextDecoder and TextEncoder. The thing is, their behavior is not what I would expect.

Specifically, suppose I have the string str containing "ùìåí," (without the quotes of course). Now, when I run:

undecoded_str = new TextEncoder("windows-1252").encode(str);

I expect to be getting the sequence

F9, EC, E5, ED, 2C

(the 1-octet windows-1252 value for each of the 5 characters). But what I actually get is:

C3, B9, C3, AC, C3, A5, C3, AD, 2C

which seems to be the UTF-8 encoding of the string. Why is this happening?


Solution

  • Annoyingly, many browser have simply dropped support for multiple character set encodings in TextEncoder (and TextDecoder):

    Note: Firefox, Chrome and Opera used to have support for encoding types other than utf-8 (such as utf-16, iso-8859-2, koi8, cp1261, and gbk). As of Firefox 48 (ticket), Chrome 54 (ticket) and Opera 41, no other encoding types are available other than utf-8, in order to match the spec. In all cases, passing in an encoding type to the constructor will be ignored and a utf-8 TextEncoder will be created (the TextDecoder still allows for other decoding types).

    Damn it!