I'm working with Yocto to create an embedded linux distribution for an ARM device (i.MX 6Quad Processors).
I've configured the list of desired locales with the variable:
IMAGE_LINGUAS = "de-de fr-fr en-gb en-gb.iso-8859-1 en-us en-us.iso-8859-1 zh-cn"
As result I've obtained a file systems that contains the following folders:
root@lam_icu:/usr/lib/locale# cd /usr/share/locale/
root@lam_icu:/usr/share/locale# ls -la
total 0
drwxr-xr-x 6 root root 416 Nov 17 2016 .
drwxr-xr-x 30 root root 2056 Nov 17 2016 ..
drwxr-xr-x 4 root root 296 Nov 17 2016 de
drwxr-xr-x 3 root root 232 Nov 17 2016 en_GB
drwxr-xr-x 4 root root 296 Nov 17 2016 fr
drwxr-xr-x 4 root root 296 Nov 17 2016 zh_CN
and:
root@lam_icu:/usr/share/locale# cd /usr/lib/locale/
root@lam_icu:/usr/lib/locale# ls -la
total 0
drwxr-xr-x 9 root root 640 Mar 13 2017 .
drwxr-xr-x 32 root root 40000 Mar 13 2017 ..
drwxr-xr-x 3 root root 1016 Mar 13 2017 de_DE
drwxr-xr-x 3 root root 1016 Mar 13 2017 en_GB
drwxr-xr-x 3 root root 1016 Mar 13 2017 en_GB.ISO-8859-1
drwxr-xr-x 3 root root 1016 Mar 13 2017 en_US
drwxr-xr-x 3 root root 1016 Mar 13 2017 en_US.ISO-8859-1
drwxr-xr-x 3 root root 1016 Mar 13 2017 fr_FR
drwxr-xr-x 3 root root 1016 Mar 13 2017 zh_CN
Which is the encoding of all non ISO-8859-1 locales? Can I assume that "en_GB" or "en_US" use the UTF-8 encoding?
I've tried to open the "LC_IDENTIFICATION" file, the result is:
Hc�������������cEnglish locale for the USAFree Software Foundation, Inc.http://www.gnu.org/software/libc/[email protected]_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000UTF-8
At the end of the file there is something that recalls "UTF-8". Is this enough to assume that the encoding is UTF-8?
How to check if a locale is UTF-8?
LC_IDENTIFICATION
doesn't tell you much:
LC_IDENTIFICATION - this is not a user-visible category, it contains information about the locale itself and is rarely useful for users or developers (but is listed here for completeness sake).
You'd have to look at the complete set of files.
There appears to be no standard command-line utility for doing this, but there is a runtime call (added a little later than the original locale functions). Here is a sample program which illustrates the function nl_langinfo
:
#include <stdio.h>
#include <locale.h>
#include <langinfo.h>
int
main(int argc, char **argv)
{
int n;
for (n = 1; n < argc; ++n) {
if (setlocale(LC_ALL, argv[n]) != 0) {
char *code = nl_langinfo(CODESET);
if (code != 0)
printf("%s ->%s\n", argv[n], code);
else
printf("?%s (nl_langinfo)\n", argv[n]);
} else {
printf("? %s (setlocale)\n", argv[n]);
}
}
return 0;
}
and some output, e.g., by foo $(locale -a)
:
aa_DJ ->ISO-8859-1
aa_DJ.iso88591 ->ISO-8859-1
aa_DJ.utf8 ->UTF-8
aa_ER ->UTF-8
aa_ER@saaho ->UTF-8
aa_ER.utf8 ->UTF-8
aa_ER.utf8@saaho ->UTF-8
aa_ET ->UTF-8
aa_ET.utf8 ->UTF-8
af_ZA ->ISO-8859-1
af_ZA.iso88591 ->ISO-8859-1
af_ZA.utf8 ->UTF-8
am_ET ->UTF-8
am_ET.utf8 ->UTF-8
an_ES ->ISO-8859-15
an_ES.iso885915 ->ISO-8859-15
an_ES.utf8 ->UTF-8
ar_AE ->ISO-8859-6
ar_AE.iso88596 ->ISO-8859-6
ar_AE.utf8 ->UTF-8
ar_BH ->ISO-8859-6
ar_BH.iso88596 ->ISO-8859-6
The directory names you're referring to are often (but not required) to be the same as encoding names. That is the assumption made in the example program. There was a related question in How to get terminal's Character Encoding, but it has no useful answers. One is interesting though, since it asserts that
locale charmap
will give the locale encoding. According to the standard, that's not necessarily so:
The command locale charmap
gives the name used in localedef -f
However, localedef
attaches no special meaning to the name given in the -f
option.
localedef
has a different option -u
which identifies the codeset, but locale
(in the standard) mentions no method for displaying this information.As usual, implementations may (or may not) treat unspecified features in different ways. The GNU C library's documentation differs in some respects from the standard (see locale
and localedef
), but offers no explicit options for showing the codeset name.