Search code examples
unicodeutf-8character-encodingbyte-order-mark

UTF-8 multibyte & bom


I had read this great tutorial
http://www.joelonsoftware.com/articles/Unicode.html

But I didn't understand how UTF-8 solves high-endian, low-endian machines thing. For 1byte, its fine. For multi byte, how it works?

Can someone explain better?


Solution

  • Here is a link that explains UTF-8 in depth. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

    At the heart of it, UTF-16 is short integer(16 bit) oriented and UTF-8 is byte oriented. Since architectures can differ on how the bytes of a datatypes are ordered(big endian, little endian) the UTF-16 encoding can go either way. On all architectures I am aware of there is no endian-ness at the nibble or semi-octet level. All bytes are a sequential series of 8 bits. Therefore UTF-8 has no endian-ness.

    The Japanese character あ is a good example. It is U+3042 (binary=0011 0000 : 0100 0010).

    • UTF-16BE: 30, 42 = 0011 0000 : 0100 0010
    • UTF-16LE: 42, 30 = 0100 0010 : 0011 0000
    • UTF-8: e3, 81, 82 = 1110 0011 : 10 0000 01 : 10 00 0010

    Here is some information on unicode あ