I have a string in C# which could contain any set of Unicode characters and I want to convert it into a hex representation of the UTF8 encoding of that string with a space between each Unicode character, so for instance the string "$£€𐍈" would be converted to an output string of "24 C2A3 E282AC F0908D88". I can't see how to do this, though. Because strings in C# are UTF16 I can't just say foreach (char entry in myString) { ... }
because a Unicode glyph can be represented by either 1 or 2 char
s, as in the case for the last glyph in my example above.
I feel like I need to end up with a byte[][]
which represents the list of characters, each represented as a list of UTF8-encoded bytes that determine the character. I could then convert those bytes to their hex representation, with spaces between the Unicode characters.
How could I achieve the desired output?
You can minimize your intermediate memory allocations by making use of the Rune
struct, and by using stackalloc'ed intermediate buffers, like so:
public static partial class TextHelper
{
public static string ToUtf8HexValues(this string s)
{
Span<byte> runeByteSpan = stackalloc byte[4]; // rune.EncodeToUtf8() can be up to 4 bytes as shown in https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Rune.cs#L1060
Span<char> charSpan = stackalloc char[2]; // Greater than or equal to the max number of hex chars in a byte value, which is 2.
var sb = new StringBuilder();
foreach (var rune in s.EnumerateRunes())
{
if (sb.Length > 0)
sb.Append(' ');
for (int i = 0, length = rune.EncodeToUtf8(runeByteSpan); i < length; i++)
if (runeByteSpan[i].TryFormat(charSpan, out var n, "X"))
sb.Append(charSpan.Slice(0, n));
}
return sb.ToString();
}
}
Notes:
Rune
was introduced in .NET Core 3. This struct:
Represents a Unicode scalar value ([ U+0000..U+D7FF ], inclusive; or [ U+E000..U+10FFFF ], inclusive).
Byte.TryFormat(Span<Char>, Int32, ReadOnlySpan<Char>, IFormatProvider)
was introduced in .NET Core 2.1 and allows a Byte
to be formatted to a fixed-length Span<char>
without a string allocation.
Demo fiddle #1 here.
If you only want a space to be inserted between grapheme clusters such as Ĥ
, then (starting with .NET 6) you can use StringInfo.GetNextTextElementLength(ReadOnlySpan<Char>)
to enumerate through a string as in chunks that cluster combining characters together:
public static partial class TextHelper
{
public static IEnumerable<ReadOnlyMemory<char>> TextElements(this string s) => (s ?? "").AsMemory().TextElements();
public static IEnumerable<ReadOnlyMemory<char>> TextElements(this ReadOnlyMemory<char> s)
{
for (int index = 0, length = StringInfo.GetNextTextElementLength(s.Span);
length > 0;
index += length, length = StringInfo.GetNextTextElementLength(s.Span.Slice(index)))
yield return s.Slice(index, length);
}
public static string ToUtf8GraphemeHexValues(this string s)
{
Span<byte> runeByteSpan = stackalloc byte[4]; // rune.EncodeToUtf8() can be up to 4 bytes as shown in https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Rune.cs#L1060
Span<char> charSpan = stackalloc char[2]; // Greater than or equal to the max number of hex chars in a byte value, which is 2.
var sb = new StringBuilder();
foreach (var chunk in s.TextElements())
{
if (sb.Length > 0)
sb.Append(' ');
foreach (var rune in chunk.Span.EnumerateRunes())
for (int i = 0, length = rune.EncodeToUtf8(runeByteSpan); i < length; i++)
if (runeByteSpan[i].TryFormat(charSpan, out var n, "X"))
sb.Append(charSpan.Slice(0, n));
}
return sb.ToString();
}
}
Notes:
StringInfo.GetTextElementEnumerator(String)
, however using this method will result in allocating a string for each grapheme cluster,Demo fiddle #2 here.