Search code examples
pythoncharacter-encodingstdoutpython-idle

How to change default character encoding for Python IDLE?


I'm using Python 3.6 on Windows. When I run a script using the standard Windows shell (cmd.exe), the default text encoding for stdin/stdout is 'utf-8' as expected in Python 3.x:

python -c "import sys; print(sys.stdout.encoding)"
utf-8

However, the same command on the IDLE shell leads to a different result, which is clearly annoying, especially for beginner students using IDLE as a first step IDE

>>> import sys; print(sys.stdout.encoding)
cp1252

It happens that IDLE defines PseudoOutputFile and PseudoInputFile classes to wrap stdout/stdin. These classes include a hidden _encoding attribute which can be used to switch encoding as needed

>>> sys.stdout._encoding = 'utf-8'
>>> print(sys.stdout.encoding)
utf-8

But this setting is cancelled each time you launch a script, as IDLE relaunches its shell when running a module. Is there any long-term solution to change IDLE's default encoding for stdin/stdout ?


Solution

  • For 2.7, 3.5, the command line you show responds, for me, with cp437 - the IBM PC or DOS encoding. Output to the Windows console is limited to a subset of Basic Multilingual Plane (BMP) Unicode characters.

    For 3.6, Python's handling of the Windows console was drastically improved to use utf-8 and potentially print any unicode character, depending on font availability.

    For all current versions, IDLE also reports, for me, cp1252 (Latin 1). Since there is an attempt to get the system encoding, I don't know why the difference. But it hardly makes any difference as it is a dummy or fake value. To me, it is deceptive in that non-latin1 chars cannot be encoded with latin1, whereas all BMP chars can be printed in IDLE. So I have thought about a replacement.

    When (unicode) strings are written to sys.stdout (usually with print), the string object is pickled to bytes in the user process, sent through a socket (implementation detail subject to change) to the IDLE process, and unpickled back to a string object. The effect is as if the string was encoded and decoded with one of the non-lossy utf codings. UTF-32 might be the closest to what pickling does.

    The IDLE process calls tkinter text.insert(index, string), which asks tk to insert the string in the widget. But that only works for BMP characters. The net effect is as if the output encoding were ucs-2, though I believe tk uses a truncated utf-8 internally.

    Similarly, any BMP character you enter in the shell or editor can be sent to the user process stdin after being displayed.

    Anyway, changing pseudofile.encoding has no effect, which is why it was made read-only by this part of the patch for issue 9290

    -        self.encoding = encoding
    +        self._encoding = encoding
    +
    +    @property
    +    def encoding(self):
    +        return self._encoding
    

    The initial underscore means that _encoding is a private (not hidden) implementation detail that should be ignored by users.