Search code examples
assemblyutf-8numbersbyte8-bit

How to convert a decimal larger than 255 into two of 8-bit (2 Bytes)


Ok I know how to convert decimal to 8-bit for a example the char "A" by decimal it's 65 It's very simple to convert it into binary But what if the decimal is largen than 255 Example the Arabic char "م" in decimal it is 1605 and in binary it is 11001000101 When I convert it in any website it shows 11011001 10000101 I want to know how 11001000101 be 11011001 10000101


Solution

  • Your Arabic char "م" has code point 1605 in decimal. This is 0645h in hexadecimal and it is 0000'0110'0100'0101b in binary.

    The utf-8 encoding will represent all characters with a code point in the range U+0000 to U+007F with 1 byte, using next template:

    0_______
     ^
     | 7 bits
    

    The utf-8 encoding will represent all characters with a code point in the range U+0080 to U+07FF with 2 bytes. Your Arabic char "م" is at U+0645h in this range.

    When dealing with 2 bytes the template becomes

    110_____ 10______
       ^       ^
       |       | 6 bits
       | 5 bits
    

    In this template we fill in the lowest (only) 11 bits of the binary representation of your code point 11001'000101b:

    110_____ 10______
       ^       ^
       | 11001 | 000101
    

    This produces the binary 110'11001'10'000101b

    Below is the x86 assembly version of the conversion for code points in [U+128, U+2047]:

                                           <------ AX ------->
    mov ax, 1605        ; Your example:    0000 0110 0100 0101
                                            /                / 
                                           /                /  Shift left the whole 16 bits, twice
    shl ax, 2                              0001 1001 0001 0100
                                                     \      \
                                                      \      \ Shift right the lowest 8 bits, twice
    shr al, 2                              0001 1001 0000 0101
                                           |||       ||
                                           |||       ||        Put in the template bits
    or  ax, 1100000010000000b              1101 1001 1000 0101
                                           <- AH --> <-- AL -> 
    

    Now the AH register contains the first byte of the utf-8 encoding and the AL register contains the second byte of the utf-8 encoding.

    Because the x86 is a little endian architecture where the lowest byte is stored first in memory, an xchg al, ah instruction will fix the order of the bytes right before moving the result to memory:
    mov [somewhere], ax.