Search code examples
phpunicodeutf-8character-encodingutf-16

What does it mean by 'Highest Bit' or 'Highest Bits' in a byte?


I'm a PHP Developer by profession.

Consider the following text regarding UTF-8 encoding standard:

UTF-8 is the variable-length encoding. If a character can be represented using a single byte, UTF-8 will encode it with a single byte. If it requires two bytes, it will use two bytes and so on. It has elaborate ways to use the highest bits in a byte to signal how many bytes a character consists of. This can save space, but may also waste space if these signal bits need to be used often.

Also, consider below UTF-8 and UTF-16 encoding example :

あ UTF-8 Encoded byte string is 11100011 10000001 10000010

あ UTF-16 Encoded byte string is 00110000 01000010

Someone please explain me the meaning of term highest bits(or highest bit) in a byte in the context of UTF-8 encoding standard and PHP.

Also, explain to me how these highest bits(or highest bit) in a byte are used to signal how many bytes a character consists of.

How this phenomenon of highest bits(or highest bit) in a byte can save space, but may also waste space if these signal bits need to be used often?

Please give your answer and explanations with the help of encoding example I've provided in the question.


Solution

  • This answer just answers your (small) questions in there, but I really suggest you read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) to get the broader picture. It's 15 years old, but the basics don't change, and it gives a good explanation of the background, the technicalities, and the history behind that. This certainly helps explaining certain issues which you will encounter in practice when doing web development with unicode, and it will help you in setting up good test cases, so your software doesn't suddenly break down when a French or Japanese person starts using it. After all, if you start using unicode, you have to do it right all the way, from the database to the charset headers.

    That said...

    High bits

    Highest bits are the bits, typically written on the left side, that represent the highest part of the value. Just like when you write 1857824, the 1 is the highest digit, (representing a million). For binary it's the same, except those numbers will always be only 0 or 1.

    Signal bits

    In unicode, instead of using all bits for the value of the character (allowing 256 different charaters in a byte), it uses a smaller number of bits and uses some bits to signal that the next byte contains more information about the same character. Those signal bits are on the 'high' side (at the front).

    Fitting characters in 2 or 3 bytes

    If you have just English text, every character will still fit in a single byte in UTF-8, and the signal bit will indicate that there is no second character. If you mix this in with, now and then, a Latin character with diacritics, some characters will be 2 bytes, but many will still be one, so it's still more space-efficient than UTF-16, which has always a multiple of 2 as the number of bytes.

    This means that UTF-16 requires less flags (1 bit in 16, instead of 1 in 8) to indicate whether there will be more groups. So UTF-16 has move more space for the character data. This results in the interesting effect for your 'Japansese a', which also fits in 2 bytes in UTF-16, while in UTF-8, you need 3 bytes, because too many signal bits are used, and there is no space to fit Japanese in 2 bytes together with all the other character sets.

    This means, that if you really worry about space, you might consider storing and sending predominantly Japanese texts in UTF-16 while storing and sending predominantly Latin texts (including English) in UTF-8. In reality, I wouldn't worry too much about that, and save yourself A LOT of trouble by choosing one and sticking to it.