I'm trying to test Unicode that out of BMP range. Below I use +UD834DF01 as an example character and try to convert it to a multibyte character, but the program failed and says 'Illegal byte sequence', why?
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <stdlib.h>
#include <limits.h>
int main(int argc, const char *argv[])
{
setlocale(LC_ALL, ""); // my locale is UTF-8
wchar_t wc = 0xd834df01;
char bytes[MB_LEN_MAX] = {0};
int r = wctomb(bytes, wc);
if (r > 0) {
for (int i = 0; i < MB_LEN_MAX; i++)
printf("0x%x\n", bytes[i]);
} else {
perror("fail");
}
return 0;
}
Hex D834DF01 is not a valid Unicode codepoint; no value above hex 110000 is. The pair (sequence of two) 'surrogate' code units D834 and DF01 is the UTF-16 encoding for codepoint U+10D301 which is in a private-use area and not a standard character, but is validly encodable in UTF-8 as f4 8d 8c 81. UTF-16 is used in much of Windows, almost all of Java, and some other places.
Correction: I did the surrogate conversion in my head and slipped a hexit; as commented it's actually U+1D301 digram for heavenly earth in Tai Xuan Jing.