I have data coming in from an outside source. The datasource is converting a ™
special character into â„¢
.
According to this chart, there's an encoding issue:
https://www.i18nqa.com/debug/utf8-debug.html
So, how do I get C# to convert â„¢
back into ™
?
I've tried the following, but I can't get back the ™
character:
byte[] bytes1 = Encoding.Unicode.GetBytes("â„¢");
String str1 = Encoding.Unicode.GetString(bytes1);
String str2 = Encoding.UTF8.GetString(bytes1);
string str3 = Encoding.UTF32.GetString(bytes1);
var bytes2 = Encoding.Default.GetBytes("â„¢");
var str4 = Encoding.UTF8.GetString(bytes2);
var str5 = Encoding.UTF32.GetString(bytes2);
var str6 = Encoding.Unicode.GetString(bytes2);
byte[] bytes3 = Encoding.UTF8.GetBytes("â„¢");
String str7 = Encoding.Unicode.GetString(bytes3);
String str8 = Encoding.UTF8.GetString(bytes3);
string str9 = Encoding.UTF32.GetString(bytes3);
byte[] bytes4 = Encoding.UTF32.GetBytes("â„¢");
String str10 = Encoding.Unicode.GetString(bytes4);
String str11 = Encoding.UTF8.GetString(bytes4);
string str12 = Encoding.UTF32.GetString(bytes4);
â„¢
is what you get when the UTF-8 encoded form of ™
(bytes E2 84 A2
) is misinterpreted in a Latin encoding like Windows-1252. To reverse it, try using Encoding.Default
to recover the bytes, and then Encoding.UTF8
to decode them, eg:
byte[] bytes = Encoding.Default.GetBytes("â„¢");
String str = Encoding.UTF8.GetString(bytes);
If Encoding.Default
doesn't work (ie, because your OS locale is something else), then use Encoding.GetEncoding()
instead:
byte[] bytes = Encoding.GetEncoding(1252).GetBytes("â„¢");
String str = Encoding.UTF8.GetString(bytes);
Of course, the real solution is to fix the interpretation of your input so you don't end up with â„¢
in the first place. The input data is UTF-8, so it needs to be interpreted as UTF-8 when brought into your program, not as Windows-1252 or similar.