Search code examples
clinuxunicodelocaleglibc

wctomb fail: Illegal byte sequence


I'm trying to test Unicode that out of BMP range. Below I use +UD834DF01 as an example character and try to convert it to a multibyte character, but the program failed and says 'Illegal byte sequence', why?

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <stdlib.h>
#include <limits.h>

int main(int argc, const char *argv[])
{
    setlocale(LC_ALL, ""); // my locale is UTF-8

    wchar_t wc = 0xd834df01;
    char bytes[MB_LEN_MAX] = {0};
    int r = wctomb(bytes, wc);
    if (r > 0) {
        for (int i = 0; i < MB_LEN_MAX; i++)
            printf("0x%x\n", bytes[i]);
    } else {
        perror("fail");
    }

    return 0;
}

Solution

  • Hex D834DF01 is not a valid Unicode codepoint; no value above hex 110000 is. The pair (sequence of two) 'surrogate' code units D834 and DF01 is the UTF-16 encoding for codepoint U+10D301 which is in a private-use area and not a standard character, but is validly encodable in UTF-8 as f4 8d 8c 81. UTF-16 is used in much of Windows, almost all of Java, and some other places.

    Correction: I did the surrogate conversion in my head and slipped a hexit; as commented it's actually U+1D301 digram for heavenly earth in Tai Xuan Jing.