Search code examples
clinuxutf-8locale

mbrtowc() fails to convert when I'm using newlocale() and uselocale() instead of setlocale()


I would like to convert UTF-8 string wide character representation to work with the code points. To do this I call uselocale() on Linux to change the locale only for the current thread only. But for some reason it doesn't seem to do what I expect. Here is a minimal program:

#define _XOPEN_SOURCE 700
#include <locale.h>
#include <wchar.h>
#include <stdio.h>
#include <assert.h>

int main()
{
    locale_t loc = newlocale(LC_ALL, "en_US.UTF-8", (locale_t)0);
    assert(loc);
    locale_t prevLocale = uselocale(loc);
    assert(prevLocale);

    wchar_t res;
    char src[] = "á";
    mbstate_t mbs = {0};
    int v = (int)mbrtowc(&res, src, sizeof(src), &mbs);
    printf("%d\n", v);
    perror("Failed to convert char");

    return 0;
}

I expect it to pick up the UTF-8 locale and convert the character, but instead when I run this, I get:

-1
Failed to convert char: Invalid or incomplete multibyte or wide character

The source file is encoded as UTF-8. So that's not a problem.

If I call the process-wide setlocale instead, like this:

#define _XOPEN_SOURCE 700
#include <locale.h>
#include <wchar.h>
#include <stdio.h>
#include <assert.h>

int main()
{
    setlocale(LC_ALL, "en_US.UTF-8");

    wchar_t res;
    char src[] = "á";
    mbstate_t mbs = {0};
    int v = (int)mbrtowc(&res, src, sizeof(src), &mbs);
    printf("%d\n", v);
    perror("Failed to convert char");

    return 0;
}

The conversion succeeds:

2
Failed to convert char: Success

I want to set the locale only for the thread to avoid interference with the process-wide setting, then later I would restore it to the original one.

I've found that uselocale() overrides the process-wide locale, so after calling uselocale(), setlocale() will have no effect while the thread level locale is in use. So uselocale() does have some effect. But it seems to behave like the "C" locale.

What I'm doing wrong here?


Solution

  • newlocale() takes _MASK not locale. See man page.

    locale_t loc = newlocale(LC_ALL_MASK, "en_US.UTF-8", (locale_t)0);