Search code examples
windowspython-2.7unicodeterminal

python: unicode in Windows terminal, encoding used?


I am using the Python interpreter in Windows 7 terminal.
I am trying to wrap my head around unicode and encodings.

I type:

>>> s='ë'
>>> s
'\x89'
>>> u=u'ë'
>>> u
u'\xeb'

Question 1: Why is the encoding used in the string s different from the one used in the unicode string u?

I continue, and type:

>>> us=unicode(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 0: ordinal
not in range(128)
>>> us=unicode(s, 'latin-1')
>>> us
u'\x89'

Question2: I tried using the latin-1 encoding on good luck to turn the string into an unicode string (actually, I tried a bunch of other ones first, including utf-8). How can I find out which encoding the terminal has used to encode my string?

Question 3: how can I make the terminal print ë as ë instead of '\x89' or u'xeb'? Hmm, stupid me. print(s) does the job.

I already looked at this related SO question, but no clues from there: Set Python terminal encoding on Windows


Solution

  • Unicode is not an encoding. You encode into byte strings and decode into Unicode:

    >>> '\x89'.decode('cp437')
    u'\xeb'
    >>> u'\xeb'.encode('cp437')
    '\x89'
    >>> u'\xeb'.encode('utf8')
    '\xc3\xab'
    

    The windows terminal uses legacy code pages for DOS. For US Windows it is:

    >>> import sys
    >>> sys.stdout.encoding
    'cp437'
    

    Windows applications use windows code pages. Python's IDLE will show the windows encoding:

    >>> import sys
    >>> sys.stdout.encoding
    'cp1252'
    

    Your results may vary.