Search code examples
c#unicodeencodingutf-8

How to Convert â„¢ Back Into ™ With C#


I have data coming in from an outside source. The datasource is converting a special character into â„¢.

According to this chart, there's an encoding issue:

https://www.i18nqa.com/debug/utf8-debug.html

So, how do I get C# to convert â„¢ back into ?

I've tried the following, but I can't get back the character:

byte[] bytes1 = Encoding.Unicode.GetBytes("â„¢");
String str1 = Encoding.Unicode.GetString(bytes1);
String str2 = Encoding.UTF8.GetString(bytes1);
string str3 = Encoding.UTF32.GetString(bytes1);

var bytes2 = Encoding.Default.GetBytes("â„¢");
var str4 = Encoding.UTF8.GetString(bytes2);
var str5 = Encoding.UTF32.GetString(bytes2);
var str6 = Encoding.Unicode.GetString(bytes2);

byte[] bytes3 = Encoding.UTF8.GetBytes("â„¢");
String str7 = Encoding.Unicode.GetString(bytes3);
String str8 = Encoding.UTF8.GetString(bytes3);
string str9 = Encoding.UTF32.GetString(bytes3);

byte[] bytes4 = Encoding.UTF32.GetBytes("â„¢");
String str10 = Encoding.Unicode.GetString(bytes4);
String str11 = Encoding.UTF8.GetString(bytes4);
string str12 = Encoding.UTF32.GetString(bytes4);

Solution

  • â„¢ is what you get when the UTF-8 encoded form of (bytes E2 84 A2) is misinterpreted in a Latin encoding like Windows-1252. To reverse it, try using Encoding.Default to recover the bytes, and then Encoding.UTF8 to decode them, eg:

    byte[] bytes = Encoding.Default.GetBytes("â„¢");
    String str = Encoding.UTF8.GetString(bytes);
    

    If Encoding.Default doesn't work (ie, because your OS locale is something else), then use Encoding.GetEncoding() instead:

    byte[] bytes = Encoding.GetEncoding(1252).GetBytes("â„¢");
    String str = Encoding.UTF8.GetString(bytes);
    

    Of course, the real solution is to fix the interpretation of your input so you don't end up with â„¢ in the first place. The input data is UTF-8, so it needs to be interpreted as UTF-8 when brought into your program, not as Windows-1252 or similar.