Search code examples
c++cgccencoding

Why is wcrtomb ASCII-only?


On my system wcrtomb() appears to think "narrow multibyte representation" means "ASCII-only" even if I compile with -fexec-charset=utf-8. I was under the impression that -fexec-charset gcc flag controls the meaning of "narrow multibyte representation" and that wcrtomb converts from "wide character set" to "narrow multibyte representation". If "narrow multibyte representation" is utf-8 and "wide character set" is utf-32 than wcrtomb should convert from utf-32 to utf-8. I know the practical answer is probably to just use explicit utf-32 to utf-8 conversion instead of depending on "wide character set" and "narrow multibyte representation". I want to understand why this does not do what I expect.

#include <clocale>
#include <cwchar>
#include <iostream>
#include <string>
#include <vector>
#include <fstream>

int main() {
    wchar_t max = 0x10FFFF;
    std::vector<char> out(MB_CUR_MAX * max);
    char *end = &out[0];
    for(wchar_t c = 0; c < max; ++c) {
        std::mbstate_t state{};
        std::size_t ret = wcrtomb(end, c, &state);
    if(ret != static_cast<std::size_t>(-1)) {
        end += ret;
    }
    }
    std::ofstream outfile("out", std::ios::out | std::ios::binary); 
    outfile.write(&out[0], end - &out[0]);
    return 0;
}
(export LC_ALL=en_US.UTF-8; g++ -fwide-exec-charset=utf-32le -fexec-charset=utf-8 main.cpp && ./a.out && cat -v ./out && echo)
^@^A^B^C^D^E^F^G^H  
^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^^^_ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~^?

What I tried:

  1. Setting -fexec-charset=utf-8 even though gcc documentation says this is the default
  2. Setting -fwide-exec-charset=utf-32le even though this appeared to already be the case
  3. Setting LC_ALL=en_US.UTF-8 both for compilation and for execution
  4. Compiling with clang instead of gcc (-fwide-exec-charset not supported, but printing __clang_wide_literal_encoding__ says UTF-32)

System info: Ubuntu 22.04.3 LTS g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Ubuntu clang version 14.0.0-1ubuntu1.1


Solution

  • Why is wcrtomb ASCII-only?

    Because the locale in your program is C. The initial locale of a C program at startup is C, which is ASCII. Conversion is locale dependent. If you want to inherit locale from the environment, use setlocale(LC_ALL, ""). See setlocale and locale.h documentation. The examples you linked to set the locale, your code does not.

    -fexec-charset gcc flag controls the meaning of "narrow multibyte representation"

    No. -fexec-charset chooses the encoding the compiler uses to convert string literals "π" in the source code to binary code. Same with -fwide-exec-charset, but for L"π" wide literals. The C standard library function choose multibyte character encoding depending on the locale.