I'm getting some information from a c++ backend via thrift protocol, containing a string (name) with german umlauts. Now these umlauts are displayed as questionmarks so I think I'm on the right path to try and convert them to utf-8, although thrift seems to pass strings as utf-8 anyway.
The original data comes from a postgresql database and is displayed correctly in the c++ code just before sending it to the thrift interface.
I already tried 3 different versions to convert but none of them really does anything am I'm stuck here.
Version 1:
private string ConvertUTF8(string str) // str == "Ha�loch, �mely"
{
byte[] bytSrc;
byte[] bytDestination;
string strTo = string.Empty;
bytSrc = Encoding.Unicode.GetBytes(str);
bytDestination = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, bytSrc);
strTo = Encoding.UTF8.GetString(bytDestination);
return strTo; // strTo == "Ha�loch, �mely"
}
Version 2:
private string ConvertUTF8(string str) // str == "Ha�loch, �mely"
{
byte[] bytes = str.Select(c => (byte)c).ToArray();
return Encoding.UTF8.GetString(bytes); // == "Ha�loch, �mely"
}
Version 3:
private string ConvertUTF8(string str) // str == "Ha�loch, �mely"
{
byte[] bytes = Encoding.Default.GetBytes(str);
return Encoding.UTF8.GetString(bytes); // == "Ha?loch, ?mely"
}
As you can see, version 3 - for whatever reason - changes the � to a regular ? but the result should be "Haßloch, Ämely". Any idea what I'm doing wrong?
edit 1:
On c++ side the string is converted from QString.toStdString() and then passed to thrift. According to QT doc the .toStdString() call includes the conversion to UTF-8 anyways (also see in top answer here). So the string should be passed correctly and thrift interface seems to also use UTF-8 internally.
edit 2:
I tried to figure out, where the first occurrence of the string would be and found this line:
Name = iprot.ReadString();
where Name
is of type string and iprot
is of type Thrift.Protocol.TCompactProtocol
For the ReadString()
method, the thrift doc says Reads a byte[] (via readBinary), and then UTF-8 decodes it
so this also can't be the reason ...
edit 3 (SOLUTION):
Marc Gravell pushed me to this ... Just replaced
Name = iprot.ReadString();
with
var bytes = iprot.ReadBinary();
Name = Encoding.GetEncoding("Windows-1252").GetString(bytes);
edit 4:
even simpler:
var bytes = iprot.ReadBinary();
Name = Encoding.Default.GetString(bytes);
If you get as far as having a string str
input, you've already lost the data. string
(System.String
) in .NET is always UTF-16. You need to look upstream, at where-ever the input data came from (presumably reading from some file, byte-buffer, http-client, or database). It is usually simply a case of specifying the correct Encoding
at the point where you originally decode the data.
You cannot fix encoding after the fact; in the code above, you've already irretrievably lost what you wanted.