Search code examples
mathunicodeutf-8hex

UTF-8 hex to unicode code point (only math)


Let's take this table with characters and HEX encodings in Unicode and UTF-8.
Does anyone know how it is possible to convert UTF-8 hex to Unicode code point using only math operations?
E.g. let's take the first row. Given 227, 129 130 how to get 12354?
Is there any simple way to do it by using only math operations?

Unicode code point UTF-8 Char
30 42 (12354) e3 (227) 81 (129) 82 (130)
30 44 (12356) e3 (227) 81 (129) 84 (132)
30 46 (12358) e3 (227) 81 (129) 86 (134)

* Source: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=12288&unicodeinhtml=hex


Solution

  • This video is the perfect source (watch from 6:15), but here is its summary and code sample in golang. With letters I mark bits taken from UTF-8 bytes, hopefully it makes sense. When you understand the logic it's easy to apply bitwise operators):

    Bytes Char UTF-8 bytes Unicode code point Explanation
    1-byte (ASCII) E 1. 0xxx xxxx
    0100 0101 or 0x45
    1. 0xxx xxxx
    0100 0101 or U+0045
    no conversion needed, the same value in UTF-8 and unicode code point
    2-byte Ê 1. 110x xxxx
    2. 10yy yyyy
    1100 0011 1000 1010 or 0xC38A
    0xxx xxyy yyyy
    0000 1100 1010 or U+00CA
    1. First 5 bits of the 1st byte
    2. First 6 bits of the 2nd byte
    3-byte 1. 1110 xxxx
    2. 10yy yyyy
    3. 10zz zzzz
    1110 0011 1000 0001 1000 0010 or 0xE38182
    xxxx yyyy yyzz zzzz
    0011 0000 0100 0010 or U+3042
    1. First 4 bits of the 1st byte
    2. First 6 bits of the 2nd byte
    3. First 6 bits of the 3rd byte
    4-byte 𐄟 1. 1111 0xxx
    2. 10yy yyyy
    3. 10zz zzzz
    4. 10ww wwww
    1111 0000 1001 0000 1000 0100 1001 1111 or 0xF090_849F
    000x xxyy yyyy zzzz zzww wwww
    0000 0001 0000 0001 0001 1111 or U+1011F
    1. First 3 bits of the 1st byte
    2. First 6 bits of the 2nd byte
    3. First 6 bits of the 3rd byte
    4. First 6 bits of the 4th byte

    2-byte UTF-8

    func get(byte1 byte, byte2 byte) {
        int1 := uint16(byte1 & 0b_0001_1111) << 6
        int2 := uint16(byte2 & 0b_0011_111)
        return rune(int1 + int2)
    }
    

    3-byte UTF-8

    func get(byte1 byte, byte2 byte, byte3 byte) {
        int1 := uint16(byte1 & 0b_0000_1111) << 12
        int2 := uint16(byte2 & 0b_0011_111) << 6
        int3 := uint16(byte3 & 0b_0011_111)
        return rune(int1 + int2 + int3)
    }
    

    4-byte UTF-8

    func get(byte1 byte, byte2 byte, byte3 byt3, byte4 byte) {
        int1 := uint(byte1 & 0b_0000_1111) << 18
        int2 := uint(byte2 & 0b_0011_111) << 12
        int3 := uint(byte3 & 0b_0011_111) << 6
        int4 := uint(byte4 & 0b_0011_111)
        return rune(int1 + int2 + int3 + int4)
    }