python string python-3.x unicode python-unicode

Decoding list of hex in python3

I have a list of hex that I would like to transform into a list of unicode characters. Everything here is done with python-3.5.

If I do print(binary.fromhex('hex_number').decode('utf-8')) it works. But does not work if, after the conversion, I store, again, the chars in the list:

a = ['0063'] # Which is the hex equivalent to the c char.
b = [binary.fromhex(_).decode('utf-8') for _ in a]
print(b)

will print

['\x00c']

instead of

['c']

while the code

a = ['0063']
for _ in a:
    print(binary.fromhex(_).decode('utf-8'))

prints, has expected:

Can someone explain to me how I can convert the list ['0063'] in the list ['c'] and why I get this strange (to me) behavior?

To see what the 0063 hex corresponds look here.

Solution

You don't have UTF-8 data, if 0063 is U+0063 LATIN SMALL LETTER C. At best you have UTF-16 data, big endian order:

>>> binary.fromhex('0063').decode('utf-16-be')
'c'

You may want to check if your full data starts with a Byte Order Mark, for big-endian UTF-16 that'd be 'FEFF' in hex, at which point you can drop the -be suffix as the decoder will know what byte order to use. If your data starts with 'FFFE' instead, you have little-endian encoded UTF-16 and you sliced your data at the wrong point; in that case you took along the '00' byte for the preceding codepoint.

UTF-8 is a variable width encoding. The first 128 codepoints in the Unicode standard (corresponding with the ASCII range), encode directly to single bytes, mapping directly to the ASCII standard. Codepoints in the Latin-1 range and beyond (up to U+07FF^(*), the next 1919 codepoints) map to two bytes, etc.

If your input really was UTF-8, then you really have a \x00 NULL character before that 'c'. Printing a NULL results in no output on many terminals, but you can use cat -v to turn such non-printable characters into caret escape codes:

$ python3 -c "print('\x00c')"
c
$ python3 -c "print('\x00c')" | cat -v
^@c

^@ is the representation for a NULL in the caret notation used by cat.

^(*) U+07FF is not currently mapped in Unicode; the last UTF-8 two-byte codepoint currently possible is U+07FA NKO LAJANYALAN.