My C# program gets some UTF-8 encoded data and decodes it using Encoding.UTF8.GetString(data)
. When the program that produces the data gets characters outside the BMP, it encodes them as 2 surrogate characters, each encoded as UTF-8 separately. In such cases, my program can't decode them properly.
How can I decode such data in C#?
Example:
static void Main(string[] args)
{
string orig = "🌎";
byte[] correctUTF8 = Encoding.UTF8.GetBytes(orig); // Simulate correct conversion using std::codecvt_utf8_utf16<wchar_t>
Console.WriteLine("correctUTF8: " + BitConverter.ToString(correctUTF8)); // F0-9F-8C-8E - that's what the C++ program should've produced
// Simulate bad conversion using std::codecvt_utf8<wchar_t> - that's what I get from the program
byte[] badUTF8 = new byte[] { 0xED, 0xA0, 0xBC, 0xED, 0xBC, 0x8E };
string badString = Encoding.UTF8.GetString(badUTF8); // ���� (4 * U+FFFD 'REPLACMENT CHARACTER')
// How can I convert this?
}
Note: The encoding program is written in C++, and converts the data using std::codecvt_utf8<wchar_t>
(code below). As @PeterDuniho's answer correctly notes, it should've used std::codecvt_utf8_utf16<wchar_t>
. Unfortunately, I don't control this program, and can't change its behavior - only handle its malformed input.
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8Converter;
std::string utf8str = utf8Converter.to_bytes(wstr);
It's impossible to know for sure without a good Minimal, Complete, and Verifiable code example. But it looks to me as though you are using the wrong converter in C++.
The std::codecvt_utf8<wchar_t>
locale converts from UCS-2, not UTF-16. The two are very similar, but UCS-2 doesn't support surrogate pairs that would be required to encode the character you want to encode.
Instead, you should be using std::codecvt_utf8_utf16<wchar_t>
:
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utf8Converter;
std::string utf8str = utf8Converter.to_bytes(wstr);
When I use that converter, I get the UTF-8 bytes needed: F0 9F 8C 8E
. These, of course, decode correctly in .NET when interpreted as UTF-8.
Addendum:
The question has been updated to indicate that the encoding code can't be changed. You are stuck with UCS-2 that has been encoded into invalid UTF8. Because the UTF8 is invalid, you'll have to decode the text yourself.
I see a couple of reasonable ways to do this. First, write a decoder that doesn't care if the UTF8 includes invalid byte sequences. Second, use the C++ std::wstring_convert<std::codecvt_utf8<wchar_t>>
converter to decode the bytes for you (e.g. write your receiving code in C++, or write a C++ DLL you can call from your C# code to do the work).
The second option is in some sense the more reliable, i.e. you're using exactly the decoder that created the bad data in the first place. On the other hand, it might be overkill even to create a DLL, never mind write the entire client in C++. Making a DLL, even using C++/CLI, you still have some headaches getting the interop to work right, unless you're already an expert.
I'm familiar, but hardly an expert, with C++/CLI. I'm much better with C#, so here's some code for the first option:
private const int _khighOffset = 0xD800 - (0x10000 >> 10);
/// <summary>
/// Decodes a nominally UTF8 byte sequence as UTF16. Ignores all data errors
/// except those which prevent coherent interpretation of the input data.
/// Input with invalid-but-decodable UTF8 sequences will be decoded without
/// error, and may lead to invalid UTF16.
/// </summary>
/// <param name="bytes">The UTF8 byte sequence to decode</param>
/// <returns>A string value representing the decoded UTF8</returns>
/// <remarks>
/// This method has not been thoroughly validated. It should be tested
/// carefully with a broad range of inputs (the entire UTF16 code point
/// range would not be unreasonable) before being used in any sort of
/// production environment.
/// </remarks>
private static string DecodeUtf8WithOverlong(byte[] bytes)
{
List<char> result = new List<char>();
int continuationCount = 0, continuationAccumulator = 0, highBase = 0;
char continuationBase = '\0';
for (int i = 0; i < bytes.Length; i++)
{
byte b = bytes[i];
if (b < 0x80)
{
result.Add((char)b);
continue;
}
if (b < 0xC0)
{
// Byte values in this range are used only as continuation bytes.
// If we aren't expecting any continuation bytes, then the input
// is invalid beyond repair.
if (continuationCount == 0)
{
throw new ArgumentException("invalid encoding");
}
// Each continuation byte represents 6 bits of the actual
// character value
continuationAccumulator <<= 6;
continuationAccumulator |= (b - 0x80);
if (--continuationCount == 0)
{
continuationAccumulator += highBase;
if (continuationAccumulator > 0xffff)
{
// Code point requires more than 16 bits, so split into surrogate pair
char highSurrogate = (char)(_khighOffset + (continuationAccumulator >> 10)),
lowSurrogate = (char)(0xDC00 + (continuationAccumulator & 0x3FF));
result.Add(highSurrogate);
result.Add(lowSurrogate);
}
else
{
result.Add((char)(continuationBase | continuationAccumulator));
}
continuationAccumulator = 0;
continuationBase = '\0';
highBase = 0;
}
continue;
}
if (b < 0xE0)
{
continuationCount = 1;
continuationBase = (char)((b - 0xC0) * 0x0040);
continue;
}
if (b < 0xF0)
{
continuationCount = 2;
continuationBase = (char)(b == 0xE0 ? 0x0800 : (b - 0xE0) * 0x1000);
continue;
}
if (b < 0xF8)
{
continuationCount = 3;
highBase = (b - 0xF0) * 0x00040000;
continue;
}
if (b < 0xFC)
{
continuationCount = 4;
highBase = (b - 0xF8) * 0x01000000;
continue;
}
if (b < 0xFE)
{
continuationCount = 5;
highBase = (b - 0xFC) * 0x40000000;
continue;
}
// byte values of 0xFE and 0xFF are invalid
throw new ArgumentException("invalid encoding");
}
return new string(result.ToArray());
}
I tested it with your globe character and it works fine for that. It also correctly decodes the proper UTF8 for that character (i.e. F0 9F 8C 8E
). You'll of course want to test it with a full range of data, if you intend to use that code for decoding all of your UTF8 input.