Ruby 1.9.3
irb(main):036:0* "\x03".force_encoding("UTF-16")
=> "\x03"
irb(main):040:0* "\x03".force_encoding("UTF-8")
=> "\u0003"
Why is "\x03".force_encoding("UTF-8") is \u0003 and "\x03".force_encoding("UTF-16") ends up with "\x03" , I thought it should be the other way round?
Because "\x03"
is not a valid code point in UTF-16, but a valid one in UTF-8 (ASCII 03, ETX, end of text). You have to use at least two bytes to represent a unicode code point in UTF-16.
That's why "\x03"
can be treated as unicode \u0003
in UTF-8 but not in UTF-16.
To represent "\u0003"
in UTF-16, you have to use two byte, either 00 03
or 03 00
, depending on the byte order. That's why we need to specify byte order in UTF-16. For the big-endian version, the byte sequence should be
FE FF 00 03
For the little-endian, the byte sequence should be
FF FE 03 00
The byte order mark should appear at the beginning of a string, or at the beginning of a file.
Starting from Ruby 1.9, String is just a byte sequence with a specific encoding as a tag. force_encoding
is a method to change the encoding tag, it won't affect the byte sequence. You can verify that by inspecting "\x03".force_encoding("UTF-8").bytes
.
If you see "\u0003"
, that doesn't mean you got a String which is represented in two bytes 00 03
, but some byte(s) that represents the Unicode code point 0003
under the specific encoding as carried in that String. It may be:
03 //tagged as UTF-8
FE FF 00 03 //tagged as UTF-16
FF FE 03 00 //tagged as UTF-16
03 //tagged as GBK
03 //tagged as ASCII
00 00 FE FF 00 00 00 03 // tagged as UTF-32
FF FE 00 00 03 00 00 00 // tagged as UTF-32