Search code examples
cwindowslocalecodeblocksstdio

Read / Write special characters (like tilde, ñ,...) in a console application C


I'm trying that a C console application can read (using the keyboard) special Spanish characters such as accents, 'ñ', etc in a scanf or gets and then, print it too with printf.

I have achieved to show these characters correctly (stored in a variable or, directly, from printf) thanks to the package locale.h. I show an example:

#include <stdio.h>
// Add languaje package
#include <locale.h>

int main(void)
{
    char string[254];

    // Set languaje to Spanish
    setlocale(LC_ALL, "spanish");

    // Show correctly spanish special chars 
    printf("¡Success!. It is shown special chars like 'ñ' or 'á'.\n\n\n");

    // Gets special chars by keyboard
    printf("Input spanish special chars (such 'ñ'): ");
    gets(string);

    printf("Your string is: %s", string);

    return 0;   
}

but I have not yet achieved to pick them up correctly with the functions mentioned above.

Does anyone know how to do it?

Thank you.


EDIT 1:

In testing, I observed that:

  • setlocale(LC_ALL, "spanish"); It shows the characters of the Spanish correctly, but it does not collect them from the keyboard.
  • setlocale(LC_ALL, "es_ES"); It picks up the Spanish characters correctly from the keyboard, but it does not show them well.


EDIT 2:

I have tryed too setlocale(LC_ALL, "");, setlocale(LC_ALL, "es_ES.UTF-8"); and setlocale(LC_ALL, "es_ES.ISO_8859-15"); with the same results as EDIT 1 (or catch well characters from keyboard or show them well in console, but never both at the same time).


Solution

  • Microsoft's C runtime library (CRT) does not support UTF-8 as the locale encoding. It only supports Windows codepages. Also, "es_ES" isn't a valid CRT locale string, so setlocale would fail, leaving you in the default C locale. Newer versions of Microsoft's CRT support Windows locale names such as "es-ES" (hyphen, not underscore). Otherwise the CRT uses the full names or the old 3-letter abbreviations, e.g. "spanish_spain", "esp_esp" or "esp_esp.1252".

    But that's not the end of the story. When reading from and writing to the console using legacy text encodings instead of Unicode, there's another layer of translation in the console itself. To avoid mojibake, you have to set the console input and output codepages (i.e. SetConsoleCP and SetConsoleOutputCP) to match the locale codepage. If you're limited to Spanish or Latin-1, then it should work to set the locale to "spanish" and set the console codepages via SetConsoleCP(1252) and SetConsoleOutputCP(1252). More generally you could look up the ANSI codepage for a given locale name, set the console codepages, and save them in order to reset the console at exit. For example:

    wchar_t *locale_name = L"es-ES";
    if (_wsetlocale(LC_ALL, locale_name)) {
        int codepage;
        gPrevConsoleCP = GetConsoleCP();
        if (gPrevConsoleCP) { // The process is attached to a console.
            gPrevConsoleOutputCP = GetConsoleOutputCP();
            if (GetLocaleInfoEx(locale_name, 
                                LOCALE_IDEFAULTANSICODEPAGE | 
                                LOCALE_RETURN_NUMBER, 
                                (LPWSTR)&codepage, 
                                sizeof(codepage) / sizeof(wchar_t))) {
                if (!codepage) { // The locale doesn't have an ANSI codepage.
                    codepage = GetACP();
                }
                SetConsoleCP(codepage);
                SetConsoleOutputCP(codepage);
                atexit(reset_console);
            }
        }
    }
    

    That said, when working with the console you will be better off in general if you set stdin and stdout to use _O_U16TEXT mode and use wide-character functions such as fgetws and wprintf. Ultimately, if supported by the C runtime library, this should use the wide-character console I/O functions ReadConsoleW and WriteConsoleW. The downside of using UTF-16 wide-character mode is that it would entail a complete rewrite of your code to use wchar_t strings and wide-character functions and also would require implementing adapters for libraries that work with multibyte encoded strings (preferably UTF-8).