Converting UTF-16 to UTF-8 using libiconv

I'm trying to convert an UTF-16 string into utf-8 and hit a little wall. The output string contains the caracters but with blank spaces!? The input is hi\0 and If I look at the output, it says h\0i\0 instead of hi\0.

Do you see the problem here? Many thanks!

size_t len16 = 3 * sizeof(wchar_t);
size_t len8 = 7;
wchar_t utf16[3] = { 0x0068, 0x0069, 0x0000 }, *_utf16 = utf16;
char utf8[7], *_utf8 = utf8;

iconv_t utf16_to_utf8 = iconv_open("UTF-8", "UTF-16LE");
size_t result = iconv(utf16_to_utf8, (char **)&_utf16, &len16, &_utf8, &len8);

printf("%d - %s\n", (int)result, utf8);

iconv_close(utf16_to_utf8);

Solution

The input data for iconv is always an opaque byte stream. When reading UTF-16, iconv expects the input data to consist of two-byte code units. Therefore, if you want to provide hard-coded input data, you need to use a two-byte wide integral type.

In C++11 and C11 this should be char16_t, but you can also use uint16_t:

uint16_t data[] = { 0x68, 0x69, 0 };

char const * p = (char const *)data;

To be pedantic, there's nothing in general that says that uint16_t has two bytes. However, iconv is a Posix library, and Posix mandates that CHAR_BIT == 8, so it is true on Posix.

(Also note that the way you spell a literal value has nothing to do with the width of the type which you initialize with that value, so there's no difference between 0x68, 0x0068, or 0x00068. What's much more interesting are the new Unicode character literals \u and \U, but that's a whole different story.)