Why the trailing 0x00 byte after BSON string (not Cstring/ename)?

obviously, for bson cstring the trailing byte is used to determine the length of the string, so it is: (byte*) "\x00". They are used as regex patterns, rexegs options and ename, which are not long / used in iterations, so the length is not necessary, but then comes...

bson string is written as: int32 (byte*) "\x00"

with specification as follows: The int32 is the number bytes in the (byte*) + 1 (for the trailing '\x00'). The (byte*) is zero or more UTF-8 encoded characters.

but why the use of trailing zero byte? if we have the utf-8 encoded string length, it is sufficient for the byte data workflow, and the 0x00 byte just adds an unneeded byte. Am I missing something?

Solution

The reasoning for both the length of the string and the null terminator is twofold: compatibility with existing C-style strings, and performance.

For performance, MongoDB needs to be able to quickly go to a specific field in a document without iterating through the whole BSON. This is important especially if you're looking for a field that is close to the end of a large (say 16 MB) document. With the length of the string encoded as one of the first information on a string type, it can just skip that number of bytes and get to the next field. Otherwise, it will need to iterate over the whole string until it finds the end of the string.

For compatibility, MongoDB is written in C++, where strings are null terminated. It can cut off that null terminator to save one byte since the length is encoded, but getting that string out of BSON into a format that's usable by C++ would require tacking on that null again. This will need specialized string handling routine that's the only advantage is saving a single byte.

Overall, it was decided that "wasting" a single byte is an acceptable tradeoff.