I have a weird input file with all kinds of control characters like nulls. I want to remove all control characters from this Windows-1252 encoded text file, but if you do this:
std::string test="tést";
for (int i=0;i<test.length();i++)
{
if (test[i]<32) test[i]=32; // change all control characters into spaces
}
It will change the é into a space as well.
So if you have a string like this, encoded in Windows-1252:
std::string test="tést";
The hex values would be:
t é s t
74 E9 73 74
See https://en.wikipedia.org/wiki/ASCII and https://en.wikipedia.org/wiki/Windows-1252
test[0] would equal to decimal 116 (=0x74), but apparently with é/0xE9, test[1] does not equal the decimal value 233.
So how can you recognize that é properly?
32
is a signed integer, comparing the char
with the signed integer is performed by the compiler as signed: E9 (-23)<32 which return true.
Using an unsigned literal of 32
, that is 32u
makes the comparison to be performed on unsigned values: E9 (233) < 32 which return false.
Replace :
if (test[i]<32) test[i]=32;
By:
if (test[i]<32u) test[i]=32u;
And you should get the expected result.
Test this here: https://onlinegdb.com/BJ8tj0kbd
Note: you can check that char
is signed with the following code:
#include <limits>
...
std::cout << std::numeric_limits<char>::is_signed << std::endl;