I don't quite understand the principles behind UTF encodings and BOM.
What is the point of having BOM in UTF-16 and UTF-32 if computers already know how to compose multibyte data types (for example, integers with the size of 4 bytes) into one variable? Why do we need to specify it explicitly for these encodings then?
And why don't we need to specify it for UTF-8? Unicode standard says that it's "byte oriented" but even then we need to know whether it is the first byte of the encoded code point or not. Or does it specified in the first / last bits of every character?
UTF-16 is two byte wide, lets call that bytes B0|B1
.
Let's say we have letter 'a' this is logically number 0x0061. Unfortunately different computer architectures store this number in different ways in memory, on x86 platform less significant byte is stored first (at lower memory address) so 'a' will be stored as 00|61
. On PowerPC this will be stored as 61|00
, these two architectures are called little endian and big endian for that reason.
To speed up string processing libraries generally store two bytes characters in native order (big ending or little endian). Swapping bytes would be too expensive.
Now imagine that someone on PowerPC writes string to a file, library will write bytes 00|61
, now someone on x86 will want to read this bytes but does it mean 00|61
or maybe 61|00
? We can put special sequence at the beginning of the string so anyone will know byte order used to save string, and process it correctly (converting string between endian's is a costly operation, but most of the time x86 string will be read on x86 arch, and PowerPC string on PowerPC machines)
With UTF-8 this is different story, UTF-8 uses single order and encodes character length into pattern of first bits of first character. UTF-8 encoding is well described on Wikipedia. Generally speaking it was designed to avoid problem with endian'ess