Below is a simplified example of my problem. I have some external byte data which appears to be a string with cp1252 encoded degree symbol 0xb0
. When it is stored in my program as an std::string
it is correctly represented as 0xffffffb0
. However, when that string is then written to a file, the resulting file is only one byte long with just 0xb0
. How do I write the string to the file? How does the concept of UTF-8
come into this?
#include <iostream>
#include <fstream>
typedef struct
{
char n[40];
} mystruct;
static void dump(const std::string& name)
{
std::cout << "It is '" << name << "'" << std::endl;
const char *p = name.data();
for (size_t i=0; i<name.size(); i++)
{
printf("0x%02x ", p[i]);
}
std::cout << std::endl;
}
int main()
{
const unsigned char raw_bytes[] = { 0xb0, 0x00};
mystruct foo;
foo = *(mystruct *)raw_bytes;
std::string name = std::string(foo.n);
dump(name);
std::ofstream my_out("/tmp/out.bin", std::ios::out | std::ios::binary);
my_out << name;
my_out.close();
return 0;
}
Running the above program produces the following on STDOUT
It is '�'
0xffffffb0
First of all, this is a must read:
Now, when you done with that, you have to understand what type represents p[i]
.
It is char
, which in C is a small size integer value with a sign
! char
can be negative!
Now, since you have cp1252
characters, they are outside the scope of ASCII. This means these characters are seen as negative values!
Now, when they are converted to int
, the sign bit is replicated, and when you are trying to print it, you will see 0xffffff<actual byte value>
.
To handle that in C
, first you should cast to unsigned char
:
printf("0x%02x ", (unsigned char)p[i]);
then the default conversion will fill in the missing bits with zeros and printf()
will give you a proper value.
Now, in C++ this is a bit more nasty, since char
and unsigned char
are treated by stream operators as a character representation. So to print them in hex manner, it should be like this:
int charToInt(char ch)
{
return static_cast<int>(static_cast<unsigned char>(ch));
}
std::cout << std::hex << charToInt(s[i]);
Now, direct conversion from char
to unsigned int
will not fix the problem since silently the compiler will perform a conversation to int
first.
See here: https://wandbox.org/permlink/sRmh8hZd78Oar7nF
UTF-8 has nothing to this issue.
Off-topic: please, when you write pure C++ code, do not use C
. It is pointless and makes code harder to maintain, and it is not faster. So:
char*
or char[]
to store strings. Just use std::string
.printf()
, use std::cout
(or the fmt
library, if you like format strings - it will became a future C++ standard).alloc()
, malloc()
, free()
- in modern C++, use std::make_unique()
and std::make_shared()
.