Convert string into other encoding

I read usernames from physical device. All usernames it's Base64 encoded strings with cyrilic symbols. Somehow device do wrong convert symbols into utf8, so i try to fix it in client side to show correct usernames in GUI.

Problem is that part of usernames converts successful, but some part raise exceptions when i try to do it.

Here is simplified example how i do it:

procedure TForm3.Button1Click(Sender: TObject);

  function Convert(ABase64String : string) : string;
  begin
    var xWin1251 :=  TEncoding.GetEncoding(1251);
    try
      var DecodedStr : string := TNetEncoding.Base64String.Decode(ABase64String);
      var DecodedBytes := TEncoding.UTF8.GetBytes(DecodedStr);
      var ConvertedBytes := TEncoding.Convert(TEncoding.UTF8, xWin1251, DecodedBytes);

      Result  :=  TEncoding.UTF8.GetString(ConvertedBytes);
    finally
      xWin1251.Free;
    end;
  end;

const
  s1 = '0KDRmtCg0ZHQodCD0KHQi9Cg0ZQg0KDRmtCg0ZHQoNGU0KDRldCgwrvQoMKwIA==';
  s2 = '0KDigJjQodGT0KDCttCg0ZHQoNCF0KHQitCh0IPQoNGU0KHigJMg0KDRmtCh4oCT0KE=';
begin
  ShowMessage(Convert(s1));
  ShowMessage(Convert(s2));
end;

s1 - converts fine.

Debug values:

DecodedStr  'РњРёСЃСЋРє РњРёРєРѕР»Р° '
DecodedBytes    (208, 160, 209, 154, 208, 160, 209, 145, 208, 161, 208, 131, 208, 161, 208, 139, 208, 160, 209, 148, 32, 208, 160, 209, 154, 208, 160, 209, 145, 208, 160, 209, 148, 208, 160, 209, 149, 208, 160, 194, 187, 208, 160, 194, 176, 32)
ConvertedBytes  (208, 156, 208, 184, 209, 129, 209, 142, 208, 186, 32, 208, 156, 208, 184, 208, 186, 208, 190, 208, 187, 208, 176, 32)
Result  'Мисюк Микола '

s2 - raise exception with message No mapping for the Unicode character exists in the target multi-byte code page instead of returning value Бужиньскі МіС.

Debug values:

DecodedStr  'Р‘СѓР¶РёРЅСЊСЃРєС– РњС–С'
DecodedBytes    (208, 160, 226, 128, 152, 208, 161, 209, 147, 208, 160, 194, 182, 208, 160, 209, 145, 208, 160, 208, 133, 208, 161, 208, 138, 208, 161, 208, 131, 208, 160, 209, 148, 208, 161, 226, 128, 147, 32, 208, 160, 209, 154, 208, 161, 226, 128, 147, 208, 161)
ConvertedBytes  (208, 145, 209, 131, 208, 182, 208, 184, 208, 189, 209, 140, 209, 129, 208, 186, 209, 150, 32, 208, 156, 209, 150, 209)

I found old program that handle with text encoding. It convert all values fine with operation called UTF8->WIN, so i definitely know that it's possible.

What did I miss?

Solution

I played around with your data, and your decoding code is correct.

The base64 you are being given decodes to UTF-8 bytes, which must then be decoded to UTF-16 and then re-encoded to Windows-1251, and then the resulting bytes must be interpreted as UTF-8 instead of as Windows-1251.

Your Convert() function is doing exactly this - although you don't need the 1st TEncoding.UTF8.GetBytes() call as the base64 decodes to UTF-8 (Base64String.Decode() assumes UTF-8 when returning a UTF-16 string), so you can omit that step completely, eg:

function Convert(ABase64String : string) : string;
begin
  var xWin1251 :=  TEncoding.GetEncoding(1251);
  try
    //var DecodedStr : string := TNetEncoding.Base64String.Decode(ABase64String);
    //var DecodedBytes := TEncoding.UTF8.GetBytes(DecodedStr);
    var DecodedBytes := TNetEncoding.Base64.DecodeStringToBytes(ABase64String);

    var ConvertedBytes := TEncoding.Convert(TEncoding.UTF8, xWin1251, DecodedBytes);
    Result := TEncoding.UTF8.GetString(ConvertedBytes);
  finally
    xWin1251.Free;
  end;
end;

Now, with that said, this approach works perfectly for your 1st example Unicode string 'Мисюк Микола ', because the base64 for that string is correct.

For your 2nd example Unicode string 'Бужиньскі МіС', this approach works almost perfectly. The base64 data actually decodes properly as above up to characters 'Бужиньскі Мі', however the base64 for the last character 'С', which is normally encoded in UTF-8 as bytes $D0 $A1, is being decoded to a single byte $D1 instead of 2 bytes $D0 $A1. That is why you get the "No mapping" error from the final TEncoding.UTF8.GetString().

So, it is not a mistake in your decoding code - you are simply decoding faulty input to begin with!

The correct base64 for the Unicode string 'Бужиньскі МіС' under the above scheme SHOULD BE:

'0KDigJjQodGT0KDCttCg0ZHQoNCF0KHQitCh0IPQoNGU0KHigJMg0KDRmtCh4oCT0KDQjg=='

NOT:

'0KDigJjQodGT0KDCttCg0ZHQoNCF0KHQitCh0IPQoNGU0KHigJMg0KDRmtCh4oCT0KE='

(ie, DQjg== instead of E= at the end)

So, this is a mistake in whoever is sending you the base64 in the first place. The correct solution is to fix the source where the base64 is coming from, not fix your decoder (because it is not broken).