Search code examples
linuxencodingutf-8localeyocto

How to check if a locale is UTF-8?


I'm working with Yocto to create an embedded linux distribution for an ARM device (i.MX 6Quad Processors).

I've configured the list of desired locales with the variable:

IMAGE_LINGUAS = "de-de fr-fr en-gb en-gb.iso-8859-1 en-us en-us.iso-8859-1 zh-cn"

As result I've obtained a file systems that contains the following folders:

root@lam_icu:/usr/lib/locale# cd /usr/share/locale/
root@lam_icu:/usr/share/locale# ls -la
total 0
drwxr-xr-x  6 root root  416 Nov 17  2016 .
drwxr-xr-x 30 root root 2056 Nov 17  2016 ..
drwxr-xr-x  4 root root  296 Nov 17  2016 de
drwxr-xr-x  3 root root  232 Nov 17  2016 en_GB
drwxr-xr-x  4 root root  296 Nov 17  2016 fr
drwxr-xr-x  4 root root  296 Nov 17  2016 zh_CN

and:

root@lam_icu:/usr/share/locale# cd /usr/lib/locale/
root@lam_icu:/usr/lib/locale# ls -la
total 0
drwxr-xr-x  9 root root   640 Mar 13  2017 .
drwxr-xr-x 32 root root 40000 Mar 13  2017 ..
drwxr-xr-x  3 root root  1016 Mar 13  2017 de_DE
drwxr-xr-x  3 root root  1016 Mar 13  2017 en_GB
drwxr-xr-x  3 root root  1016 Mar 13  2017 en_GB.ISO-8859-1
drwxr-xr-x  3 root root  1016 Mar 13  2017 en_US
drwxr-xr-x  3 root root  1016 Mar 13  2017 en_US.ISO-8859-1
drwxr-xr-x  3 root root  1016 Mar 13  2017 fr_FR
drwxr-xr-x  3 root root  1016 Mar 13  2017 zh_CN

Which is the encoding of all non ISO-8859-1 locales? Can I assume that "en_GB" or "en_US" use the UTF-8 encoding?

I've tried to open the "LC_IDENTIFICATION" file, the result is:

Hc�������������cEnglish locale for the USAFree Software Foundation, Inc.http://www.gnu.org/software/libc/[email protected]_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000en_US:2000UTF-8

At the end of the file there is something that recalls "UTF-8". Is this enough to assume that the encoding is UTF-8?

How to check if a locale is UTF-8?


Solution

  • LC_IDENTIFICATION doesn't tell you much:

    LC_IDENTIFICATION - this is not a user-visible category, it contains information about the locale itself and is rarely useful for users or developers (but is listed here for completeness sake).

    You'd have to look at the complete set of files.

    There appears to be no standard command-line utility for doing this, but there is a runtime call (added a little later than the original locale functions). Here is a sample program which illustrates the function nl_langinfo:

    #include <stdio.h>
    #include <locale.h>
    #include <langinfo.h>
    
    int
    main(int argc, char **argv)
    {
        int n;
        for (n = 1; n < argc; ++n) {
            if (setlocale(LC_ALL, argv[n]) != 0) {
    
                char *code = nl_langinfo(CODESET);
                if (code != 0)
                    printf("%s ->%s\n", argv[n], code);
                else
                    printf("?%s (nl_langinfo)\n", argv[n]);
            } else {
                printf("? %s (setlocale)\n", argv[n]);
            }
        }
        return 0;
    }
    

    and some output, e.g., by foo $(locale -a):

    aa_DJ ->ISO-8859-1
    aa_DJ.iso88591 ->ISO-8859-1
    aa_DJ.utf8 ->UTF-8
    aa_ER ->UTF-8
    aa_ER@saaho ->UTF-8
    aa_ER.utf8 ->UTF-8
    aa_ER.utf8@saaho ->UTF-8
    aa_ET ->UTF-8
    aa_ET.utf8 ->UTF-8
    af_ZA ->ISO-8859-1
    af_ZA.iso88591 ->ISO-8859-1
    af_ZA.utf8 ->UTF-8
    am_ET ->UTF-8
    am_ET.utf8 ->UTF-8
    an_ES ->ISO-8859-15
    an_ES.iso885915 ->ISO-8859-15
    an_ES.utf8 ->UTF-8
    ar_AE ->ISO-8859-6
    ar_AE.iso88596 ->ISO-8859-6
    ar_AE.utf8 ->UTF-8
    ar_BH ->ISO-8859-6
    ar_BH.iso88596 ->ISO-8859-6
    

    The directory names you're referring to are often (but not required) to be the same as encoding names. That is the assumption made in the example program. There was a related question in How to get terminal's Character Encoding, but it has no useful answers. One is interesting though, since it asserts that

    locale charmap
    

    will give the locale encoding. According to the standard, that's not necessarily so:

    • The command locale charmap gives the name used in localedef -f

    • However, localedef attaches no special meaning to the name given in the -f option.

    • localedef has a different option -u which identifies the codeset, but locale (in the standard) mentions no method for displaying this information.

    As usual, implementations may (or may not) treat unspecified features in different ways. The GNU C library's documentation differs in some respects from the standard (see locale and localedef), but offers no explicit options for showing the codeset name.