javascript html node.js utf-8 computer-science

UTF-8 vs UTF-16 and UTF-32 conversion confusion

I'm kinda confused about conversion of unicode characters into hexadecimal values.

I'm using this website to get hexadecimal value for characters. (https://www.branah.com/unicode-converter)

If I put "A" and convert then I get something like:

0041 --> UTF-16
00000041 --> UTF-32
41 --> UTF-8
00065 --> Decimal Value

This above output makes sense because we can convert all these hexadecimal values into 65.

Now, If i put "Я" (without quotes) and convert it then I get values like.

042f --> UTF-16
0000042f --> UTF-32
d0af --> UTF-8
01071 --> Decimal Value

This output doesn't make sense to me because not all these hexadecimal values convert back to 1071.

If you you take d0af and try to convert it back to decimal value then you will get 53423.

This is something that is really confusing for me and I've searching online to find answers about this conversion but so far I've not been able to find any good answer.

So, I'm wondering if anyone here can help. (that would mean alot) // Thanks in advance.

you can also see below link for example of this conversion in binary. (and can you explain why utf-8 binary value is different in last example??)

http://kunststube.net/encoding/

Solution

UTF-8 uses variable length encoding (can use 1, 2, 3 or 4 bytes to store a single character).

In this case:

d0af = 11010000 10101111

110 at the start tells us to expect 2 bytes when decoding it (looking at the byte 1 column of the schematic). When decoding we use the binary digits that follow the first 0 in the byte. So, 110x xxxx the x's are our first lot of values for our actual unicode value. Every additional byte follows the pattern of 10xx xxxx. So taking the values from byte 1 & 2 we get:

110[10000] 10[101111] = 
      V        V
     10000 101111 = 42f = 1071

The reason this is done is that for common characters less bytes are needed for transmission and storage. But on the odd occasion that a uncommon character is needed it can still be used at part of UTF-8.

If you have any questions, please comment.