Search code examples
clocalefgetssetlocale

fgets isn't using the set up locale


Considering the following code:

#include <stdio.h>
#include <locale.h>

int main()
{
    char test[100];

    printf("WITHOUT LOCALE: á, é, í, ó, ú, ü, ñ, ¿, ¡\n");

    setlocale(LC_CTYPE, "Spanish");

    printf("WITH LOCALE: á, é, í, ó, ú, ü, ñ, ¿, ¡\n");

    fgets(test, 100, stdin);

    printf("WITH FGETS AND LOCALE: %s\n", test);
    return 0;

}

And the following input for fgets:

á, é, í, ó, ú, ü, ñ, ¿, ¡

I'd expect it to support the special characters according to the locale that has been set up beforehand. However, this is the output:

WITHOUT LOCALE: ß, Ú, Ý, ¾, ·, ³, ±, ┐, í
WITH LOCALE: á, é, í, ó, ú, ü, ñ, ¿, ¡
WITH FGETS AND LOCALE:  , ', ¡, ¢, £, ?, ¤, ¨, ­

Any idea about what could be happening?


Solution

  • As I am repeatedly encountering questions like these in my 9-to-5 work, I came up with a side-by-side table of common 8-bit encodings.

    Using that table, it appears that:

    • your editor saved the source in CP-1252 (where e.g. 'ó' -> 0xf3)
    • the first output line is that byte interpreted as (DOS) CP-850 (0xf3 -> '¾'),
    • the second line (after setlocale()) is CP-1252 encoding (0xf3 -> 'ó'),
    • the third line is input read in CP-850 and displayed as CP-1252 ('ó' -> 0xa2 -> '¢').

    (I assumed a Windows platform -- CP-1252 -- as non-Windows platforms would not come up with CP-850 unless forced to at gunpoint. The source encoding could also be ISO 8859-1 / Western European, or ISO 8859-9 / Turkish, impossible to tell apart with the given character set. It could not be ISO 8859-15, as that would have turned 'ñ' into '€', not '¤'. It could not be any other ISO 8859 encoding, as only -1, -9 and -15 turn '¿' into '┐'.)

    Note that the interpretation of non-ASCII-7 characters in C source code is implementation-defined, so you have to make sure that your editor, the terminal (if any), and the compiler agree on the encoding used. If at all possible, set your environment to use Unicode (UTF-8 being the most practical) throughout, to avoid exactly this kind of problem. I also recommend using octal escapes for anything non-ASCII-7 in your source, as you don't know what encoding settings others will use when feeding your source to their editors / compilers.