Search code examples
c#unicodeutf-8hex

Converting a C# string to a hex representation of its UTF8-encoded characters


I have a string in C# which could contain any set of Unicode characters and I want to convert it into a hex representation of the UTF8 encoding of that string with a space between each Unicode character, so for instance the string "$£€𐍈" would be converted to an output string of "24 C2A3 E282AC F0908D88". I can't see how to do this, though. Because strings in C# are UTF16 I can't just say foreach (char entry in myString) { ... } because a Unicode glyph can be represented by either 1 or 2 chars, as in the case for the last glyph in my example above.

I feel like I need to end up with a byte[][] which represents the list of characters, each represented as a list of UTF8-encoded bytes that determine the character. I could then convert those bytes to their hex representation, with spaces between the Unicode characters.

How could I achieve the desired output?


Solution

  • You can minimize your intermediate memory allocations by making use of the Rune struct, and by using stackalloc'ed intermediate buffers, like so:

    public static partial class TextHelper
    {
        public static string ToUtf8HexValues(this string s)
        {
            Span<byte> runeByteSpan = stackalloc byte[4]; // rune.EncodeToUtf8() can be up to 4 bytes as shown in https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Rune.cs#L1060
            Span<char> charSpan = stackalloc char[2]; // Greater than or equal to the max number of hex chars in a byte value, which is 2.
            var sb = new StringBuilder();
            foreach (var rune in s.EnumerateRunes())
            {
                if (sb.Length > 0)
                    sb.Append(' ');
                for (int i = 0, length = rune.EncodeToUtf8(runeByteSpan); i < length; i++)
                    if (runeByteSpan[i].TryFormat(charSpan, out var n, "X"))
                        sb.Append(charSpan.Slice(0, n));
            }
            return sb.ToString();
        }
    }
    

    Notes:

    Demo fiddle #1 here.

    If you only want a space to be inserted between grapheme clusters such as , then (starting with .NET 6) you can use StringInfo.GetNextTextElementLength(ReadOnlySpan<Char>) to enumerate through a string as in chunks that cluster combining characters together:

    public static partial class TextHelper
    {
        public static IEnumerable<ReadOnlyMemory<char>> TextElements(this string s) => (s ?? "").AsMemory().TextElements();
    
        public static IEnumerable<ReadOnlyMemory<char>> TextElements(this ReadOnlyMemory<char> s)
        {
            for (int index = 0, length = StringInfo.GetNextTextElementLength(s.Span); 
                 length > 0; 
                 index += length, length = StringInfo.GetNextTextElementLength(s.Span.Slice(index)))
                yield return s.Slice(index, length);
        }   
        
        public static string ToUtf8GraphemeHexValues(this string s)
        {
            Span<byte> runeByteSpan = stackalloc byte[4]; // rune.EncodeToUtf8() can be up to 4 bytes as shown in https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Text/Rune.cs#L1060
            Span<char> charSpan = stackalloc char[2]; // Greater than or equal to the max number of hex chars in a byte value, which is 2.
            var sb = new StringBuilder();
            foreach (var chunk in s.TextElements())
            {
                if (sb.Length > 0)
                    sb.Append(' ');
                foreach (var rune in chunk.Span.EnumerateRunes())
                    for (int i = 0, length = rune.EncodeToUtf8(runeByteSpan); i < length; i++)
                        if (runeByteSpan[i].TryFormat(charSpan, out var n, "X"))
                            sb.Append(charSpan.Slice(0, n));
            }
            return sb.ToString();
        }
    }
    

    Notes:

    Demo fiddle #2 here.