Search code examples
cgccunicodelibc

How can I ensure gcc + libc has UTF-8 for multibyte strings and UTF-32 for wchar_t?


I want to know how to force a GCC + GNU libc toolchain into normal Unicode behaviour, where the source code files encoding is UTF-8, and where the compiled program uses UTF-8 as its multibyte character set and UTF-32LE as its wchar_t, regardless of any locale info.

And I want to be able to know at compile time that it is going to work.

I know the normal answer is to use setlocale(LC_ALL, "en_US.utf8")?, But it seems you can only know if setlocale(LC_ALL, "en_US.utf-8") is going to work at runtime, since only the "C" and "POSIX" locales are guaranteed to exist and, unless I'm missing something, you can't compile a locale into your executable.

GCC has these flags -finput-charset=utf-8 -fexec-charset=utf-8 -fwide-exec-charset=utf-32le but it is unclear how they work with setlocale(). If I used them, do I need to call setlocale()? Are they overridden by setlocale()?

It seems like there should be some reliable way to force gcc + libc into normal Unicode behaviour without having to know what locales are preinstalled on the source or target systems.


Solution

  • This is not possible, and you don't want it anyway.

    The interfaces defined by locale.h and wchar.h are a decade older than Unicode, and their data model is built around these assumptions:

    1. There are many character sets and encodings, and none of them can necessarily represent all the characters your program might need to be able to handle over its lifetime.
    2. However, any single use of your program will only need to process text from one language, and in one encoding.
    3. Any one installation of the operating system will only need to process text in a small number of languages, knowable at installation time.

    All three of these assumptions are invalid nowadays. Instead we have:

    1. There is a single character set (Unicode) whose design goal is to represent all of the world's living written languages (how close we come to achieving that goal depends on who you talk to and how seriously you take Weinreich's Maxim).
    2. There are only a few encodings of all of Unicode to worry about, but data in 8-bit encodings that map to a subset of Unicode is still commonly encountered, and there are dozens of these.
    3. It is normal for a single run of a program to need to process text in multiple languages and in many different encodings. You can usually assume that a single file is all in one encoding, but not that you won't be called upon to merge data from sources in UTF-8, ISO-8859-2, and KOI8-R (for example).
    4. The whole concept of an "installation" (one corporation, one sysadmin, a handful of shared minicomputers, tens or hundreds of lusers) is obsolete, and so is the idea that you won't wake up tomorrow and discover you've received email in a script you'd never even heard of before --- and the computer is still expected to render it correctly and recognize it for machine translation.

    Because the data model is no good anymore, so too are the interfaces. My honest recommendation is that you forget you ever heard of locale.h or any ISO C or POSIX interface that deals in wchar_t. Instead use a third-party library (e.g. ICU) whose data model is a better fit for the modern world.

    Types for characters and strings specifically encoded in UTF-n (n=8, 16, 32) have recently been added to the C standard, and in principle they should make this situation better, but I don't have any experience with them, and as far as I can tell the standard library barely takes notice of them.

    (For more detail on the failings of the locale.h and/or wchar_t APIs and the present state of efforts to improve the C standard library, see https://thephd.dev/cuneicode-and-the-future-of-text-in-c.)