trace(escape("д"));
will print "%D0%B4", the correct URL encoding for this character (Cyrillic equivalent of "A").
However, if I were to do..
myTextArea.htmlText += unescape("%D0%B4");
What gets printed is:
д
which is of course incorrect. Simply tracing the above unescape returns the correct Cyrillic character, though! For this texarea, escaping "д" returns its unicode code-point "%u0434".
I'm not sure what exactly is happening to mess this up, but...
UTF-16 д in web encoding is: %FE%FF%00%D0%00%B4
Whereas
UTF-16 д in web encoding is: %00%D0%00%B4
So it's padding this value with something at the beginning. Why would a trace provide different text than a print to an (empty) textarea? What's goin' on?
The textarea in question has no weird encoding properties attached to it, if that sort of thing is even possible.
The problem is unescape
(escape
could also be a problem, but it's not the culprit in this case). These functions are not multibyte aware. What escape
does is this: it takes a byte in the input string and returns its hex representation with a %
prepended. unescape
does the opposite. The key point here is that they work with bytes, not characters.
What you want is encodeURIComponent
/ decodeURIComponent
. Both use utf-8 as the string encoding scheme (the encoding using by flash everywhere). Note that it's not utf-16 (which you shouldn't care about as long as flash is concerned).
encodeURIComponent("д"); //%D0%B4
decodeURIComponent("%D0%B4"); // д
Now, if you want to dig a bit deeper, here's what's going on (this assumes a basic knowledge of how utf-8 works).
escape("д")
This returns
%D0%B4
Why?
"д" is treated by flash as utf-8. The codepoint for this character is 0x0434.
In binary:
0000 0100 0011 0100
It fits in two utf-8 bytes, so it's encoded thus (where e
means encoding bit, and p
means payload bit):
1101 0000 1011 0100
eeep pppp eepp pppp
Converting it to hex, we get:
0xd0 0xb4
So, 0xd0,0xb4 is a utf-8 encoded "д".
This is fed to escape
. escape
sees two bytes, and gives you:
%d0%b4
Now, you pass this to unescape
. But unescape
is a little bit brain-dead, so it thinks one byte is one and the same thing as one char, always. As far as unescape
is concerned, you have two bytes, hence, you have two chars. If you look up the code-points for 0xd0 and 0xb4, you'll see this:
0xd0 -> Ð
0xb4 -> ´
So, unescape
returns a string consisting of two chars, Ð
and ´
(instead of figuring out that the two bytes it got where actually just one char, utf-8 encoded). Then, when you assign the text property, you are not really passing д´ but
д`, and this is what you see in the text area.