I want to know how to force a GCC + GNU libc toolchain into normal Unicode behaviour, where the source code files encoding is UTF-8, and where the compiled program uses UTF-8 as its multibyte character set and UTF-32LE as its wchar_t, regardless of any locale info.
And I want to be able to know at compile time that it is going to work.
I know the normal answer is to use setlocale(LC_ALL, "en_US.utf8")
?, But it seems you can only know if setlocale(LC_ALL, "en_US.utf-8") is going to work at runtime, since only the "C" and "POSIX" locales are guaranteed to exist and, unless I'm missing something, you can't compile a locale into your executable.
GCC has these flags -finput-charset=utf-8 -fexec-charset=utf-8 -fwide-exec-charset=utf-32le
but it is unclear how they work with setlocale(). If I used them, do I need to call setlocale()? Are they overridden by setlocale()?
It seems like there should be some reliable way to force gcc + libc into normal Unicode behaviour without having to know what locales are preinstalled on the source or target systems.
This is not possible, and you don't want it anyway.
The interfaces defined by locale.h
and wchar.h
are a decade older than Unicode, and their data model is built around these assumptions:
All three of these assumptions are invalid nowadays. Instead we have:
Because the data model is no good anymore, so too are the interfaces. My honest recommendation is that you forget you ever heard of locale.h
or any ISO C or POSIX interface that deals in wchar_t
. Instead use a third-party library (e.g. ICU) whose data model is a better fit for the modern world.
Types for characters and strings specifically encoded in UTF-n (n=8, 16, 32) have recently been added to the C standard, and in principle they should make this situation better, but I don't have any experience with them, and as far as I can tell the standard library barely takes notice of them.
(For more detail on the failings of the locale.h
and/or wchar_t
APIs and the present state of efforts to improve the C standard library, see https://thephd.dev/cuneicode-and-the-future-of-text-in-c.)