I am sorry, if this is much of a dumb question. But I can't really figure this out, and I bet it has to be much simpler than I think.
I have a byte[]
array which contains several Unicode Strings, each char takes clearly 2 bytes, and each string is delimited by two 00 00 bytes, until double 00 00 marks the end of it all.
When I try to use UnicodeEncoding.Unicode.GetString(myBuffer)
I do get the first string, but when the delimiter byte is found it start to get garbage all around.
Right now I am parsing byte by byte and then concatenating things but I am sure there has to be a better way into this.
I was wondering if I should try to find the "position" of the delimiter bytes and then limit the GetString
method to that lent? But if so, how do you find 2 the position of 2 specific bytes in a byte array?
the example byte array looks like this:
Hex View
00000000 73 00 74 00 72 00 31 00 00 00 73 00 74 00 72 00 s.t.r.1...s.t.r.
00000010 32 00 00 00 73 00 74 00 72 00 33 00 00 00 00 00 2...s.t.r.3.....
So your buffer is valid little endian UTF-16 data. Those "double 00 bytes" is just the NUL character, or \0
.
Encoding.Unicode.GetString(myBuffer)
will actually correctly decode the whole buffer, but it'll have embedded NUL characters in it delimiting each sub string. Which is fine, because \0
is just like any character. This isn't C.
The sample code below will use Console.WriteLine
to signify "use the substring", but feel free to substitute with what is appropriate.
If you split by \0
after decoding, you can get all the substrings, removing empty entries to get rid of those final NULs:
var decoded = Encoding.Unicode.GetString(myBuffer);
foreach(var str in decoded.Split('\0', StringSplitOptions.RemoveEmptyEntries))
Console.WriteLine(str);
Alternatively, you can search for the first NUL if you want:
var index = decoded.IndexOf('\0');
var firstStr = decoded.Substring(0, index);
If you don't want to do it all in one go because you have to process a lot of data at once, then you could just find the next 0 0
byte sequence, and then decode from there:
var toFindSpan = MemoryMarshal.AsBytes("\0".AsSpan());
var units = myBuffer.AsSpan();
while (!units.IsEmpty)
{
var index = units.IndexOf(toFindSpan);
if (index == -1)
break;
if (index > 0)
{
var str = Encoding.Unicode.GetString(units[..index]);
Console.WriteLine(str);
}
units = units[(index + toFindSpan.Length + 1)..];
}
Alternatively, cast to a span of char
, which would allow you to use ToString()
on the span to get a string, skipping the decoding step, but this assumes the data is all valid text (as ultimately, all you're doing is skipping the validation). Up to you.
But then if you have that much data on hand you probably should be reading from a stream, using a StreamReader
to decode on the go:
using var stream = new MemoryStream(myBuffer, false);
using var reader = new StreamReader(stream, Encoding.Unicode);
var current = new StringBuilder();
int c;
while((c = reader.Read()) != -1)
{
if (c == 0)
{
if(current.Length > 0)
{
var str = current.ToString();
Console.WriteLine(str);
current.Clear();
}
}
else
{
current.Append((char)c);
}
}
An optimisation to the code above would be to call Read
with a block of chars and then piece the data back yourself:
using var stream = new MemoryStream(myBuffer, false);
using var reader = new StreamReader(stream, Encoding.Unicode);
Span<char> batch = stackalloc char[4096];
var current = new StringBuilder();
int read;
while ((read = reader.Read(batch)) > 0)
{
var left = batch[..read];
while (!left.IsEmpty)
{
var index = left.IndexOf('\0');
if (index == -1)
{
current.Append(left);
break;
}
else
{
current.Append(left[..index]);
// we have a string, collect it
var str = current.ToString();
Console.WriteLine(str);
current.Clear();
left = left[(index + 1)..];
}
}
}
if(current.Length > 0)
{
// don't forget letfovers
var str = current.ToString();
Console.WriteLine(str);
}
That should get you decent results.
Now, those two final approaches use a StringBuilder
to build the substrings, but you don't have to do it that way, you can send those characters elsewhere (maybe you're writing them to a file).