Search code examples
utf-8character-encodingcjkurl-encoding

What is the encoding of Chinese characters on Wikipedia?


I was looking at the encoding of Chinese characters on Wikipedia and I'm having trouble figuring out what they are using. For instance "的" is encoded as "%E7%9A%84" (see here). That's three bytes, however none of the encodings described on this page uses three bytes to represent Chinese characters. UTF-8 for instance uses 2 bytes.

I'm basically trying to match these three bytes to an actual character. Any suggestion on what encoding it could be?


Solution

  • 
    >>> c='\xe7\x9a\x84'.decode('utf8')
    >>> c
    u'\u7684'
    >>> print c
    的
    


    though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes.