Search code examples
pythonpython-unicode

why do python need unicode type since I can declare a variable directly with any unicode character?


I read carefully about the unicode pain article days ago and asked this question hours ago:

Do I have to encode unicode variable before write to file?

But lately a strange question came into my mind.

I found out that these codes work fine:

chinese = ['中文', '你好']  # py2, these are bytes, type is str
with open('filename', 'wb') as f:
    f.writelines(chinese)

Since I can declare a variable directly with any unicode characters both in py2 and py3, what do python(or we) get unicode type involved? Can't we just use str(py2) and bytes(py3) type through our entire program? Then the so-called unicode pain won't exist.

Can anybody give me some insights?


Solution

  • Since I can declare a variable directly with any unicode characters [...]

    But that's not what you've done. They may look like characters, but they are encoded as bytes in the source file. If you try to do anything actually useful with the values, e.g. slice, subscript, take the length of, then everything breaks down. That is the "Unicode pain".

    >>> '中文'[1]
    '\xb8'