Search code examples
arraysunicodeencodingglibvala

How to convert from UTF-16 array to UTF-8 string?


I have a situation where I receive UTF-16 codepoints (one at a time). So I'm collecting them in a list and later convert the list to an array.

That leaves me with a uint16[], but GLib.convert () needs a string instead:

int main () {
    var utf16data = new Gee.ArrayList<uint16> ();

    utf16data.add ('A');
    utf16data.add (0xD83C);
    utf16data.add (0xDC1C);

    var utf16array = utf16data.to_array ();

    try {
        // convert expects a string here
        var s = convert (utf16array, utf16data.size * 2, "UTF-8", "UTF-16LE");
        stdout.printf ("%s\n", s);
    } 
    catch (ConvertError e) {
        stderr.printf (@"error: $(e.message)\n");
    }

    return 0;
}

So how do I convert a UTF-16 array into a UTF-8 string?

Update:

I tried to just cast the array:

int main () {
    var utf16data = new Gee.ArrayList<uint16> ();

    utf16data.add ('A');
    utf16data.add (0xD83C);
    utf16data.add (0xDC1C);
    // utf16data.add (0);

    var utf16array = utf16data.to_array ();

    try {
        size_t bytes_read;
        size_t bytes_written;
        var s = convert ((string) utf16array, utf16data.size * 2, "UTF-8", "UTF-16LE", out bytes_read, out bytes_written);
        stdout.puts (@"bytes_read = $bytes_read\n");
        stdout.puts (@"bytes_written = $bytes_written\n");
        stdout.puts (@"s.length = $(s.length)\n");
        // Should print "A🀜", but the Unicode symbol is not printed
        stdout.puts (@"s = $s\n");
    } 
    catch (ConvertError e) {
        stderr.printf (@"error: $(e.message)\n");
    }

    return 0;
}

Now at least the "A" is written to stdout, but the Unicode symbol is not.

bytes_read = 6
bytes_written = 3
s.length = 1
s = A

Is it correct to just cast an array to a string in this context?

Why is the Unicode symbol not converted?

Update 2:

This is the code that I have now settled with:

int main () {
    var utf16data = new Gee.ArrayList<uint16> ();

    utf16data.add ('A');
    utf16data.add (0xD83C);
    utf16data.add (0xDC1C);

    // Replacement for 
    // utf16array = utf16data.to_array;
    uint16[] utf16array = new uint16[utf16data.size];
    for (int i = 0; i < utf16data.size; i++)
        utf16array[i] = utf16data[i];

    try {
        var s = convert ((string)utf16array, utf16array.length * 2, "UTF-8", "UTF-16LE");
        stdout.puts (@"$s\n");
    } 
    catch (ConvertError e) {
        stderr.puts (@"error: $(e.message)\n");
    }

    return 0;
}

Solution

  • The problem is with the to_array. It does not produce an array of uint16, but an array to pointers, with the value set to the uint16 value. This is the standard boxed representations. There seems to be a problem in Gee that it isn't producing an array of the correct type. If you change the array to:

    uint16[] utf16array = {'A', 0xD83C, 0xDC1C};
    

    It works just fine.