Search code examples
rubyunicodeutf-8string-conversionucs2

How to convert array of UCS-2 bytes to UTF-8 string in Ruby?


I have an array of UCS-2LE encoded bytes in Ruby and since this is my complete beginning with Ruby I'm struggling to convert it to UTF-8 string, I have the same code in PHP & Java working just fine.

In PHP I'm using iconv library, but in Ruby iconv has been deprecated:

$str = iconv('UCS-2LE', 'UTF-8//IGNORE', implode($byte_array));

In Java I'm using:

str = new String(byte_array, "UTF-16LE");

Bytes in the array are encoded as 2 bytes per 1 character, how to perform similar conversion in Ruby? I've tried a few solutions but it didn't work for me. Thank you.


Solution

  • Assuming a byte array:

    byte_array = [70, 0, 111, 0, 111, 0]
    

    You can use Array#pack to convert the integer values to characters (C treats each integer as an unsigned char):

    string = byte_array.pack("C*")       #=> "F\x00o\x00o\x00"
    

    pack returns a string with ASCII-8BIT encoding:

    string.encoding                      #=> #<Encoding:ASCII-8BIT>
    

    You can now use String#force_encoding to reinterpret the bytes as an UTF-16 string:

    string.force_encoding("UTF-16LE")    #=> "Foo"
    

    The bytes haven't changed so far:

    string.bytes                         #=> [70, 0, 111, 0, 111, 0]
    

    To transcode the string into another encoding, use String#encode:

    utf8_string = string.encode("UTF-8") #=> "Foo"
    utf8_string.bytes                    #=> [70, 111, 111]
    

    The whole conversion can be written in a single line:

    byte_array.pack("C*").force_encoding("UTF-16LE").encode("UTF-8")
    

    or by passing the source encoding as a 2nd argument to encode:

    byte_array.pack("C*").encode("UTF-8", "UTF-16LE")