Search code examples
c#arrayssplitunicode-string

convert byte array to strings split by NUL character


I am sorry, if this is much of a dumb question. But I can't really figure this out, and I bet it has to be much simpler than I think.

I have a byte[] array which contains several Unicode Strings, each char takes clearly 2 bytes, and each string is delimited by two 00 00 bytes, until double 00 00 marks the end of it all.

When I try to use UnicodeEncoding.Unicode.GetString(myBuffer) I do get the first string, but when the delimiter byte is found it start to get garbage all around.

Right now I am parsing byte by byte and then concatenating things but I am sure there has to be a better way into this.

I was wondering if I should try to find the "position" of the delimiter bytes and then limit the GetString method to that lent? But if so, how do you find 2 the position of 2 specific bytes in a byte array?

the example byte array looks like this:

Hex View
 
00000000  73 00 74 00 72 00 31 00  00 00 73 00 74 00 72 00  s.t.r.1...s.t.r.
00000010  32 00 00 00 73 00 74 00  72 00 33 00 00 00 00 00  2...s.t.r.3.....

Solution

  • So your buffer is valid little endian UTF-16 data. Those "double 00 bytes" is just the NUL character, or \0.

    Encoding.Unicode.GetString(myBuffer) will actually correctly decode the whole buffer, but it'll have embedded NUL characters in it delimiting each sub string. Which is fine, because \0 is just like any character. This isn't C.

    The sample code below will use Console.WriteLine to signify "use the substring", but feel free to substitute with what is appropriate.

    First approach: decode the whole thing

    If you split by \0 after decoding, you can get all the substrings, removing empty entries to get rid of those final NULs:

    var decoded = Encoding.Unicode.GetString(myBuffer);
    foreach(var str in decoded.Split('\0', StringSplitOptions.RemoveEmptyEntries))
        Console.WriteLine(str);
    

    Alternatively, you can search for the first NUL if you want:

    var index = decoded.IndexOf('\0');
    var firstStr = decoded.Substring(0, index);
    

    Second approach: split, then decode

    If you don't want to do it all in one go because you have to process a lot of data at once, then you could just find the next 0 0 byte sequence, and then decode from there:

    var toFindSpan = MemoryMarshal.AsBytes("\0".AsSpan());
    var units = myBuffer.AsSpan();
    while (!units.IsEmpty)
    {
        var index = units.IndexOf(toFindSpan);
        if (index == -1)
            break;
    
        if (index > 0)
        {
            var str = Encoding.Unicode.GetString(units[..index]);
            Console.WriteLine(str);
        }
        units = units[(index + toFindSpan.Length + 1)..];
    }
    

    Alternatively, cast to a span of char, which would allow you to use ToString() on the span to get a string, skipping the decoding step, but this assumes the data is all valid text (as ultimately, all you're doing is skipping the validation). Up to you.

    Third approach: reading from a stream, character by character

    But then if you have that much data on hand you probably should be reading from a stream, using a StreamReader to decode on the go:

    using var stream = new MemoryStream(myBuffer, false);
    using var reader = new StreamReader(stream, Encoding.Unicode);
        
    var current = new StringBuilder();
    int c;
    while((c = reader.Read()) != -1)
    {
        if (c == 0)
        {
            if(current.Length > 0)
            {
                var str = current.ToString();
                Console.WriteLine(str);
                current.Clear();
            }
        }
        else
        {
            current.Append((char)c);
        }
    }
    

    Fourth approach: reading from a stream in batches

    An optimisation to the code above would be to call Read with a block of chars and then piece the data back yourself:

    using var stream = new MemoryStream(myBuffer, false);
    using var reader = new StreamReader(stream, Encoding.Unicode);
        
    Span<char> batch = stackalloc char[4096];
        
    var current = new StringBuilder();
    int read;
    while ((read = reader.Read(batch)) > 0)
    {
        var left = batch[..read];
        while (!left.IsEmpty)
        {
            var index = left.IndexOf('\0');
            if (index == -1)
            {
                current.Append(left);
                break;
            }
            else
            {
                current.Append(left[..index]);
                
                // we have a string, collect it
                var str = current.ToString();
                Console.WriteLine(str);
    
                current.Clear();
                    
                left = left[(index + 1)..];
            }
        }
    }
        
    if(current.Length > 0)
    {
        // don't forget letfovers
        var str = current.ToString();
        Console.WriteLine(str);
    }
    

    That should get you decent results.

    Now, those two final approaches use a StringBuilder to build the substrings, but you don't have to do it that way, you can send those characters elsewhere (maybe you're writing them to a file).