Search code examples
c++delphiunicodeutf-8utf-16

Looking for the description of the algorithm to convert UTF8 to UTF16


I have 3 bytes representing an unicode char encoded in utf8. For example I have E2 82 AC (UTF8) that represent the unicode char € (U+20AC). Is their any algorithm to make this conversion? I know their is the windows api MultiByteToWideChar but I would like to know if their is a simple mathematical relation between E2 82 AC and U+20AC. So is the mapping between utf8 -> utf16 a simple mathematic function or if it's a hardcoded map.


Solution

  • Converting a valid UTF-8 byte sequence directly to UTF-16 is doable with a little mathematical know-how.

    Validating a UTF-8 byte sequence is fairly trivial. See The Algorithm to Validate an UTF-8 String. Be sure to check for things like:

    • the first byte matches one of the patterns below, and that (byte and $C0) = $80 is true for each subsequent byte in the sequence.
    • the byte sequence is not more than 4 bytes long.
    • the byte sequence is not an "overlong" encoding of the codepoint (ie, it must use the least number of bytes required to encode the codepoint).
    • the codepoint does not have a restricted value, such as the reserved UTF-16 surrogates U+D800..U+DFFF, or be greater than U+10FFFF. These codepoints should NEVER appear in any valid Unicode text.

    The first byte in a UTF-8 sequence tells you how many bytes are in the sequence:

    (byte1 and $80) = $00: 1 byte  // 0xxxxxxx
    (byte1 and $E0) = $C0: 2 bytes // 110xxxxx
    (byte1 and $F0) = $E0: 3 bytes // 1110xxxx
    (byte1 and $F8) = $F0: 4 bytes // 11110xxx
    anything else: error
    

    There are very simple formulas for converting UTF-8 1-byte, 2-byte, and 3-byte sequences to UTF-16, as they all represent Unicode codepoints below U+10000, and thus can be represented as-is in UTF-16 using just one 16-bit codeunit, no surrogates needed, just some bit twiddling, eg:

    1 byte:

    UTF16 = UInt16(byte1 and $7F)
    

    2 bytes:

    UTF16 = (UInt16(byte1 and $1F) shl 6)
            or UInt16(byte2 and $3F)
    

    3 bytes:

    UTF16 = (UInt16(byte1 and $0F) shl 12)
            or (UInt16(byte2 and $3F) shl 6)
            or UInt16(byte3 and $3F)
    

    Converting a UTF-8 4-byte sequence to UTF-16, on the other hand, is slightly more involved, since it represents a Unicode code point that is U+10000 or higher, and thus will need to use UTF-16 surrogates, which requires some additional math to calculate, eg:

    4 bytes:

    CP = (UInt32(byte1 and $07) shl 18)
         or (UInt32(byte2 and $3F) shl 12)
         or (UInt32(byte3 and $3F) shl 6)
         or UInt32(byte4 and $3F)
    CP = CP - $10000
    highSurrogate = $D800 + UInt16((CP shr 10) and $3FF)
    lowSurrogate = $DC00 + UInt16(CP and $3FF)
    UTF16 = highSurrogate, lowSurrogate
    

    Now, with that said, let's look at your example: E2 82 AC

    The first byte $E2 has a bit pattern of 11100010, which matches the start pattern of a 3-byte sequence (ie ($E2 and $F0) = $E0 is true).

    The second byte $82 has a bit pattern of 10000010, which matches the pattern of a continuation byte (ie ($82 and $C0) = $80 is true).

    The third byte $AC has a bit pattern of 10101100, which matches the pattern of a continuation byte (ie ($AC and $C0) = $80 is true).

    So, this is a readable UTF-8 3-byte sequence.

    Plugging in those byte values into the 3-byte formula, you get:

    UTF16 = (UInt16($E2 and $0F) shl 12)
            or (UInt16($82 and $3F) shl 6)
            or UInt16($AC and $3F)
    
          = (UInt16($02) shl 12)
            or (UInt16($02) shl 6)
            or UInt16($2C)
    
          = $2000
            or $80
            or $2C
    
          = $20AC
    

    Further validating, the most compact form of U+20AC does require 3 bytes in UTF-8, thus proving that the byte sequence is not an "overlong" encoding. And U+20AC is not a restricted codepoint.

    So, this is a valid UTF-8 3-byte sequence.

    And indeed, Unicode codepoint U+20AC is encoded in UTF-16 as a single codeunit $20AC.