Consider the following code:
#include <string>
#include <fstream>
#include <iomanip>
int main() {
std::string s = "\xe2\x82\xac\u20ac";
std::ofstream out("test.txt");
out << s.length() << ":" << s << std::endl;
out << std::endl;
out.close();
}
Under GCC 4.8 on Linux (Ubuntu 14.04), the file test.txt
contains this:
6:€€
Under Visual C++ 2013 on Windows, it contains this:
4:€\x80
(By '\x80' I mean the single 8-bit character 0x80).
I've been completely unable to get either compiler to output a €
character using std::wstring
.
Two questions:
char*
literal? It's obviously doing something to encode it, but what is not clear.std::wstring
and std::wofstream
so that it outputs two €
characters?This is because you are using \u20ac
which is a Unicode character literal in an ASCII string.
MSVC encodes "\xe2\x82\xac\u20ac"
as 0xe2, 0x82, 0xac, 0x80,
which is 4 narrow characters. It essentially encodes \u20ac
as 0x80 because it mapped the euro character to the standard 1252 codepage
GCC is converting the Unicode literal /u20ac
to the 3-byte UTF-8 sequence 0xe2, 0x82, 0xac
so the resulting string ends up as 0xe2, 0x82, 0xac, 0xe2, 0x82, 0xac
.
If you use std::wstring = L"\xe2\x82\xac\u20ac"
it gets encoded by MSVC as 0xe2, 0x00, 0x82, 0x00, 0xac, 0x00, 0xac, 0x20
which is 4 wide characters, but since you are mixing a hand-created UTF-8 with a UTF-16, the resulting string doesn't make much sense. If you use a std::wstring = L"\u20ac\u20ac"
you get 2 Unicode characters in a wide-string as you'd expect.
The next problem is that MSVC's ofstream and wofstream always write in ANSI/ASCII. To get it to write in UTF-8 you should use <codecvt>
(VS 2010 or later):
#include <string>
#include <fstream>
#include <iomanip>
#include <codecvt>
int main()
{
std::wstring s = L"\u20ac\u20ac";
std::wofstream out("test.txt");
std::locale loc(std::locale::classic(), new std::codecvt_utf8<wchar_t>);
out.imbue(loc);
out << s.length() << L":" << s << std::endl;
out << std::endl;
out.close();
}
and to write UTF-16 (or more specifically UTF-16LE):
#include <string>
#include <fstream>
#include <iomanip>
#include <codecvt>
int main()
{
std::wstring s = L"\u20ac\u20ac";
std::wofstream out("test.txt", std::ios::binary );
std::locale loc(std::locale::classic(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>);
out.imbue(loc);
out << s.length() << L":" << s << L"\r\n";
out << L"\r\n";
out.close();
}
Note: With UTF-16 you have to use a binary mode rather than text mode to avoid corruption, so we can't use std::endl
and have to use L"\r\n"
to get the correct end-of-line text file behavior.