Flash CS4/AS3: differing behavior between console and textarea for printing UTF-16 characters

trace(escape("д"));

will print "%D0%B4", the correct URL encoding for this character (Cyrillic equivalent of "A").

However, if I were to do..

myTextArea.htmlText += unescape("%D0%B4");

What gets printed is:

Ð´

which is of course incorrect. Simply tracing the above unescape returns the correct Cyrillic character, though! For this texarea, escaping "д" returns its unicode code-point "%u0434".

I'm not sure what exactly is happening to mess this up, but...

UTF-16 Ð´ in web encoding is: %FE%FF%00%D0%00%B4

Whereas

UTF-16 д in web encoding is: %00%D0%00%B4

So it's padding this value with something at the beginning. Why would a trace provide different text than a print to an (empty) textarea? What's goin' on?

The textarea in question has no weird encoding properties attached to it, if that sort of thing is even possible.

Solution

The problem is unescape (escape could also be a problem, but it's not the culprit in this case). These functions are not multibyte aware. What escape does is this: it takes a byte in the input string and returns its hex representation with a % prepended. unescape does the opposite. The key point here is that they work with bytes, not characters.

What you want is encodeURIComponent / decodeURIComponent. Both use utf-8 as the string encoding scheme (the encoding using by flash everywhere). Note that it's not utf-16 (which you shouldn't care about as long as flash is concerned).

encodeURIComponent("д"); //%D0%B4
decodeURIComponent("%D0%B4"); // д

Now, if you want to dig a bit deeper, here's what's going on (this assumes a basic knowledge of how utf-8 works).

escape("д")

This returns

%D0%B4

Why?

"д" is treated by flash as utf-8. The codepoint for this character is 0x0434.

In binary:

0000 0100 0011 0100

It fits in two utf-8 bytes, so it's encoded thus (where e means encoding bit, and p means payload bit):

1101 0000 1011 0100
eeep pppp eepp pppp

Converting it to hex, we get:

0xd0  0xb4

So, 0xd0,0xb4 is a utf-8 encoded "д".

This is fed to escape. escape sees two bytes, and gives you:

%d0%b4

Now, you pass this to unescape. But unescape is a little bit brain-dead, so it thinks one byte is one and the same thing as one char, always. As far as unescape is concerned, you have two bytes, hence, you have two chars. If you look up the code-points for 0xd0 and 0xb4, you'll see this:

0xd0 -> Ð
0xb4 -> ´

So, unescape returns a string consisting of two chars, Ð and ´ (instead of figuring out that the two bytes it got where actually just one char, utf-8 encoded). Then, when you assign the text property, you are not really passing д´ butÐ´`, and this is what you see in the text area.