There are some special chinese words like '觱' '踨', when I check its code point of gb18030 as follow.
>>>
u'觱'.encode('gb18030')
'\xd3v'
I have been confused about the code point '\xd3v'. It's not a correct hex-digits.
Who can explain it clearly?
Actually, I have a task that converting code points of gb18030,like 'CDF2' 'F4A5' etc..., into
its corresponding unicode encoding.
>>>
'CDF2'.decode('hex').decode('gb18030')
u'\u4e07'
But,
>>>
'd3v'.decode('hex').decode('gb18030')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/hex_codec.py", line 42, in hex_decode
output = binascii.a2b_hex(input)
TypeError: Odd-length string
So, I don't understand why the encode method return an non-hex code point.
For example, what's the meaning 'v' of '\xd3v'?
'\xd3v' == '\xd3\x76'
. Python prints ASCII printables (including \n
, \t
, ...) as a letter instead of hexadecimal form.
>>> '\xd3v' == '\xd3\x76'
True
If you want to get hexadeicmal format, use encode('hex')
(as you did for decode)
>>> u'觱'.encode('gb18030').encode('hex')
'd376'
or using binascii.hexlify
:
>>> binascii.hexlify(u'觱'.encode('gb18030'))
'd376'