Search code examples
pythonencodingdecodinghexdump

How the encode method return an non-hex code point in Python?


There are some special chinese words like '觱' '踨', when I check its code point of gb18030 as follow.

>>>u'觱'.encode('gb18030')
'\xd3v'

I have been confused about the code point '\xd3v'. It's not a correct hex-digits.
Who can explain it clearly?

Actually, I have a task that converting code points of gb18030,like 'CDF2' 'F4A5' etc..., into
its corresponding unicode encoding.

>>>'CDF2'.decode('hex').decode('gb18030')
u'\u4e07'

But,

>>>'d3v'.decode('hex').decode('gb18030')

Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/encodings/hex_codec.py", line 42, in hex_decode
        output = binascii.a2b_hex(input)
    TypeError: Odd-length string

So, I don't understand why the encode method return an non-hex code point.
For example, what's the meaning 'v' of '\xd3v'?


Solution

  • '\xd3v' == '\xd3\x76'. Python prints ASCII printables (including \n, \t, ...) as a letter instead of hexadecimal form.

    >>> '\xd3v' == '\xd3\x76'
    True
    

    If you want to get hexadeicmal format, use encode('hex') (as you did for decode)

    >>> u'觱'.encode('gb18030').encode('hex')
    'd376'
    

    or using binascii.hexlify:

    >>> binascii.hexlify(u'觱'.encode('gb18030'))
    'd376'