Search code examples
cjsonencodingutf-8cjson

C - How to convert wide char Japanese characters to UTF-8?


Trying to convert Japanese characters stored in wide char to UTF-8, in order to store the value in a json file using cJSON library. First tried using wcstombs_s but apparently this does not support Japanese characters:

size_t len = wcslen(japanese[i].name) + 1;
char* japanese_char = malloc(len);
if (japanese_char == NULL) {
    exit(EXIT_FAILURE);
}
size_t sz;
wcstombs_s(&sz, japanese_char, len, japanese[i].name, _TRUNCATE);

Then, based on other answers but also in a successful conversion from json UTF-8 to wide char, tried the opposite function as follows, but the destination buffer dest contains only garbage characters:

size_t wcsChars = wcslen(japanese[i].name);
size_t sizeRequired = WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, NULL, 0, NULL, NULL);
char* dest = calloc(sizeRequired, 1);
WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, dest, sizeRequired, NULL, NULL);
free(dest);

The wide char (wchar_t) I am trying to convert is ササササササササササササササササ stored in japanese[i].name (a wchar_t* in a struct). Objective is to use cJSON's cJSON_CreateString to save the value in a UTF-8 encoded json file.

Question: What is the proper way to convert Japanese from wchar_t to UTF-8 char in C (not C++)?


Solution

  • Your wcstombs_s() code is passing the wrong value to the sizeInBytes parameter:

    sizeInBytes

    The size in bytes of the mbstr buffer.

    You are passing in the character count of japanese[i].name, not the allocated byte count of japanese_char. They are not the same value.

    Unicode codepoints are encoded in UTF-16 (what wchar_t strings are encoded as on Windows) using 2 or 4 bytes each, and in UTF-8 using 1-4 bytes each, depending on their value. Unicode codepoints in the U+0080..U+FFFF range take up more bytes in UTF-8 than they do in UTF-16, so it is possible that your japanese_char buffer needs to actually be allocated larger than your japanese[i].name data. Just like you can call WideCharToMultiByte() to determine the destination buffer size needed, you can do the same thing with wcstombs_s().

    size_t len = 0;
    wcstombs_s(&len, NULL, 0, japanese[i].name, _TRUNCATE);
    if (len == 0)
        exit(EXIT_FAILURE);
    char* japanese_char = malloc(len);
    if (!japanese_char)
        exit(EXIT_FAILURE);
    wcstombs_s(&len, japanese_char, len, japanese[i].name, _TRUNCATE);
    ...
    free(japanese_char);
    

    Your WideCharToMultiByte() code is not null-terminating dest due to you passing an explicit size to the cchWideChar parameter.

    cchWideChar

    Size, in characters, of the string indicated by lpWideCharStr. Alternatively, this parameter can be set to -1 if the string is null-terminated. If cchWideChar is set to 0, the function fails.

    If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting character string has a terminating null character, and the length returned by the function includes this character.

    If this parameter is set to a positive integer, the function processes exactly the specified number of characters. If the provided size does not include a terminating null character, the resulting character string is not null-terminated, and the returned length does not include this character.

    cJSON_CreateString() expects a null-terminated char* string. So you need to either:

    • add +1 to the num parameter of calloc() to account for the missing null terminator.
    size_t wcsChars = wcslen(japanese[i].name);
    size_t len = WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, NULL, 0, NULL, NULL);
    char* japanese_char = malloc(len + 1);
    if (!japanese_char)
        exit(EXIT_FAILURE);
    WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, japanese_char, len, NULL, NULL);
    japanese_char[len] = '\0';
    ...
    free(japanese_char);
    
    • add +1 to the return value of wcslen(), or set the cchWideChar parameter of WideCharToMultiByte() to -1, to include the null terminator in the output.
    size_t wcsChars = wcslen(japanese[i].name) + 1;
    size_t len = WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, NULL, 0, NULL, NULL);
    if (len == 0)
        exit(EXIT_FAILURE);
    char* japanese_char = malloc(len);
    if (!japanese_char)
        exit(EXIT_FAILURE);
    WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, wcsChars, japanese_char, len, NULL, NULL);
    ...
    free(japanese_char);
    
    size_t len = WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, -1, NULL, 0, NULL, NULL);
    if (len == 0)
        exit(EXIT_FAILURE);
    char* japanese_char = malloc(len);
    if (!japanese)
        exit(EXIT_FAILURE);
    WideCharToMultiByte(CP_UTF8, 0, japanese[i].name, -1, japanese_char, len, NULL, NULL);
    ...
    free(dest);