Search code examples
c++extended-ascii

How to get Windows-1252 character values in c++?


I have a weird input file with all kinds of control characters like nulls. I want to remove all control characters from this Windows-1252 encoded text file, but if you do this:

std::string test="tést";
for (int i=0;i<test.length();i++)
{
     if (test[i]<32) test[i]=32; // change all control characters into spaces
}

It will change the é into a space as well.

So if you have a string like this, encoded in Windows-1252:

std::string test="tést";

The hex values would be:

t  é  s  t
74 E9 73 74

See https://en.wikipedia.org/wiki/ASCII and https://en.wikipedia.org/wiki/Windows-1252

test[0] would equal to decimal 116 (=0x74), but apparently with é/0xE9, test[1] does not equal the decimal value 233.

So how can you recognize that é properly?


Solution

  • 32 is a signed integer, comparing the char with the signed integer is performed by the compiler as signed: E9 (-23)<32 which return true.

    Using an unsigned literal of 32, that is 32umakes the comparison to be performed on unsigned values: E9 (233) < 32 which return false.

    Replace :

    if (test[i]<32) test[i]=32;
    

    By:

    if (test[i]<32u) test[i]=32u;
    

    And you should get the expected result.

    Test this here: https://onlinegdb.com/BJ8tj0kbd

    Note: you can check that char is signed with the following code:

    #include <limits>
    ...
    std::cout << std::numeric_limits<char>::is_signed << std::endl;