When writing a string to a binary file using C#, the length (in bytes) is automatically prepended to the output. According to the MSDN documentation this is an unsigned integer, but is also a single byte. The example they give is that a single UTF-8 character would be three written bytes: 1 size byte and 2 bytes for the character. This is fine for strings up to length 255, and matches with the behaviour I've observed.
However, if your string is longer than 255 bytes, the size of the unsigned integer grows as necessary. As a simple example, consider 1024 characters as:
string header = "ABCDEFGHIJKLMNOP";
for (int ii = 0; ii < 63; ii++)
{
header += "ABCDEFGHIJKLMNOP";
}
fileObject.Write(header);
results in 2-bytes prepending the string. Creating a 2^17 length string results in a somewhat maddening 3-byte array.
The question, therefore, is how to know how many bytes to read to get the size of what follows when reading? I wouldn't necessarily know a priori the header size. Ultimately, can I force the Write(string) method to always use a consistent size (say 2 bytes)?
A possible workaround is to write my own write(string) method, but I would like to avoid that for obvious reasons (similar questions here and here accept this as an answer). Another more palatable workaround is to have the reader look for a specific character that starts the ASCII string information (maybe an unprintable character?), but that is not infallible. A final workaround (that I can think of) would be to force the string to be within the range of sizes for a particular number of size bytes; again, that is non ideal.
While forcing the size of the byte array to be consistent is the easiest, I have control over the reader so any clever reader solutions are also welcome.
BinaryWriter
and BinaryReader
aren't the only way of writing binary data; simply: they provide a convention that is shared between that specific reader and writer. No, you can't tell them to use another convention - unless of course you subclass both of them and override the ReadString
and Write(string)
methods completely.
If you want to use a different convention, then simply: don't use BinaryReader
and BinaryWriter
. It is pretty easy to talk to a Stream
directly using any text Encoding
you want to get hold of the bytes and the byte count. Then you can use whatever convention you want. If you only ever need to write strings up to 65k then sure: use fixed 2 bytes (unsigned short). You'll also need to decide which byte comes first, of course (the "endianness").
As for the size of the prefix: it is essentially using:
int byteCount = this._encoding.GetByteCount(value);
this.Write7BitEncodedInt(byteCount);
with:
protected void Write7BitEncodedInt(int value)
{
uint num = (uint) value;
while (num >= 0x80)
{
this.Write((byte) (num | 0x80));
num = num >> 7;
}
this.Write((byte) num);
}
This type of encoding of lengths is pretty common - it is the same idea as the "varint" that "protobuf" uses, for example (base-128, least significant group first, retaining bit order in 7-bit groups, 8th bit as continuation)