Search code examples
ruby-on-railsrubyencodingutf-8utf-16

Ruby 1.9.3 Why does "\x03".force_encoding("UTF-8") get \u0003 ,but "\x03".force_encoding("UTF-16") gets "\x03"


Ruby 1.9.3

irb(main):036:0* "\x03".force_encoding("UTF-16")
=> "\x03"
irb(main):040:0* "\x03".force_encoding("UTF-8")
=> "\u0003"

Why is "\x03".force_encoding("UTF-8") is \u0003 and "\x03".force_encoding("UTF-16") ends up with "\x03" , I thought it should be the other way round?


Solution

  • Because "\x03" is not a valid code point in UTF-16, but a valid one in UTF-8 (ASCII 03, ETX, end of text). You have to use at least two bytes to represent a unicode code point in UTF-16.

    That's why "\x03" can be treated as unicode \u0003 in UTF-8 but not in UTF-16.

    To represent "\u0003" in UTF-16, you have to use two byte, either 00 03 or 03 00, depending on the byte order. That's why we need to specify byte order in UTF-16. For the big-endian version, the byte sequence should be

    FE FF 00 03
    

    For the little-endian, the byte sequence should be

    FF FE 03 00
    

    The byte order mark should appear at the beginning of a string, or at the beginning of a file.

    Starting from Ruby 1.9, String is just a byte sequence with a specific encoding as a tag. force_encoding is a method to change the encoding tag, it won't affect the byte sequence. You can verify that by inspecting "\x03".force_encoding("UTF-8").bytes.

    If you see "\u0003", that doesn't mean you got a String which is represented in two bytes 00 03, but some byte(s) that represents the Unicode code point 0003 under the specific encoding as carried in that String. It may be:

    03              //tagged as UTF-8
    FE FF 00 03     //tagged as UTF-16
    FF FE 03 00     //tagged as UTF-16
    03              //tagged as GBK
    03              //tagged as ASCII
    00 00 FE FF 00 00 00 03 // tagged as UTF-32
    FF FE 00 00 03 00 00 00 // tagged as UTF-32