Search code examples
c++windowsasciicodepages

how to detect non-ascii characters in C++ Windows?


I'm simply trying detect non-ascii characters in my C++ program on Windows. Using something like isascii() or :

bool is_printable_ascii = (ch & ~0x7f) == 0 && 
                          (isprint() || isspace()) ;

does not work because non-ascii characters are getting mapped to ascii characters before or while getchar() is doing its thing. For example, if I have some code like:

#include <iostream>
using namespace std;
int main()
{
    int c;
    c = getchar();
    cout << isascii(c) << endl;
    cout << c << endl;
    printf("0x%x\n", c);
    cout << (char)c;
    return 0;
}

and input a 😁 (because i am so happy right now), the output is

1
63
0x3f
?

Furthermore, if I feed the program something (outside of the extended ascii range (codepage 437)) like 'Ĥ', I get the output to be

1
72
0x48
H

This works with similar inputs such as Ĭ or ō (goes to I and o). So this seems algorithmic and not just mojibake or something. A quick check in python (via same terminal) with a program like

i = input()
print(ord(i))

gives me the expected actual hex code instead of the ascii mapped one (so its not the codepage or the terminal (?)). This makes me believe getchar() or C++ compilers (tested on VS compiler and g++) is doing something funky. I have also tried using cin and many other alternatives. Note that I've tried this on Linux and I cannot reproduce this issue which makes me inclined to believe that it is something to do with Windows (10 pro). Can anyone explain what is going on here?


Solution

  • Okay, I have solved this. I was not aware of translation modes.

    _setmode(_fileno(stdin), _O_WTEXT);
    

    Was the solution. The link below essentially explains that there are translation modes and I think phase 5 (character-set mapping) explains what happened. https://en.cppreference.com/w/cpp/language/translation_phases