Search code examples
c#qtutf-8thriftthrift-protocol

c# utf-8 conversion problems with german umlauts


I'm getting some information from a c++ backend via thrift protocol, containing a string (name) with german umlauts. Now these umlauts are displayed as questionmarks so I think I'm on the right path to try and convert them to utf-8, although thrift seems to pass strings as utf-8 anyway.

The original data comes from a postgresql database and is displayed correctly in the c++ code just before sending it to the thrift interface.

I already tried 3 different versions to convert but none of them really does anything am I'm stuck here.

Version 1:

private string ConvertUTF8(string str) // str == "Ha�loch, �mely"
{
  byte[] bytSrc;
  byte[] bytDestination;
  string strTo = string.Empty;

  bytSrc = Encoding.Unicode.GetBytes(str);
  bytDestination = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, bytSrc);
  strTo = Encoding.UTF8.GetString(bytDestination);

  return strTo; // strTo == "Ha�loch, �mely"
}

Version 2:

private string ConvertUTF8(string str) // str == "Ha�loch, �mely"
{
  byte[] bytes = str.Select(c => (byte)c).ToArray();
  return Encoding.UTF8.GetString(bytes); // == "Ha�loch, �mely"
}

Version 3:

private string ConvertUTF8(string str) // str == "Ha�loch, �mely"
{
  byte[] bytes = Encoding.Default.GetBytes(str);
  return Encoding.UTF8.GetString(bytes); // == "Ha?loch, ?mely"
}

As you can see, version 3 - for whatever reason - changes the � to a regular ? but the result should be "Haßloch, Ämely". Any idea what I'm doing wrong?

edit 1:

On c++ side the string is converted from QString.toStdString() and then passed to thrift. According to QT doc the .toStdString() call includes the conversion to UTF-8 anyways (also see in top answer here). So the string should be passed correctly and thrift interface seems to also use UTF-8 internally.

edit 2:

I tried to figure out, where the first occurrence of the string would be and found this line:

Name = iprot.ReadString();

where Name is of type string and iprot is of type Thrift.Protocol.TCompactProtocol

For the ReadString() method, the thrift doc says Reads a byte[] (via readBinary), and then UTF-8 decodes it so this also can't be the reason ...

edit 3 (SOLUTION):

Marc Gravell pushed me to this ... Just replaced

Name = iprot.ReadString();

with

var bytes = iprot.ReadBinary();
Name = Encoding.GetEncoding("Windows-1252").GetString(bytes);

edit 4:

even simpler:

var bytes = iprot.ReadBinary();
Name = Encoding.Default.GetString(bytes);

Solution

  • If you get as far as having a string str input, you've already lost the data. string (System.String) in .NET is always UTF-16. You need to look upstream, at where-ever the input data came from (presumably reading from some file, byte-buffer, http-client, or database). It is usually simply a case of specifying the correct Encoding at the point where you originally decode the data.

    You cannot fix encoding after the fact; in the code above, you've already irretrievably lost what you wanted.