Search code examples
pythoncharacter-encodingfuture-proof

Deducing console codepage when PEP 528 may or may not be implemented?


I've got some code that needs to print certain non-ASCII characters to the console, when the console might not be in UTF-8 mode.

On Linux and Mac, it's simply the responsibility of anyone rowdy enough to use a non UTF-8 terminal to set LANG and LC_CTYPE* appropriately. And on Windows with CPython ≥3.6, the PSF handles this with PEP 528.

*If you're reading this in the future (on CPython ≥3.15) it looks like that'll be PYTHONIOENCODING.

However, that still leaves un-handled the case of people using something like PyPy on Windows. In that case, Python starts up naïvely with sys.stdout.encoding == 'utf-8', regardless of the actual codepage the terminal is using (and there's no "rowdy user", Homebrew maintainer, or Linux distribution administrator to be held responsible for not setting good env vars in this case.).

I'm currently working around it by just setting sys.stdout.encoding to whatever chcp says whenever "PyPy on Windows" is detected, but this will fail when PyPy implements PEP 528:

import platform
import re
import subprocess
import sys

#import colorama

def fixit():
    # implementation note: MUST be run before the first read from stdin.
    # (stdout and sterr may be already written-to, albeit maybe corruptedly.)
    if platform.system() == 'Windows':
        #colorama.just_fix_windows_console()
        if platform.python_implementation() == 'PyPy':
            if sys.pypy_version_info > (7, 3, 15):
                import warnings
                warnings.warn("Applying workaround for https://github.com/pypy/pypy/issues/2999")
            chcp_output = subprocess.check_output(['chcp.com'], encoding='ascii')
            cur_codepage = int(re.match(r'Active code page: (\d+)', chcp_output).group(1))
            cur_encoding = WINDOWS_CODEPAGES[cur_codepage]
            for f in [sys.stdin, sys.stdout, sys.stderr]:
                if f.encoding != cur_encoding
                    f.reconfigure(encoding=cur_encoding)


WINDOWS_CODEPAGES = {
  437: 'ibm437',
  850: 'ibm850',
  1252: 'windows-1252',
  28591: 'iso-8859-1',
  28592: 'iso-8859-2',
  28593: 'iso-8859-3',
  65000: 'utf-7',
  65001: 'utf-8'
}

Now, it seems to me to me that calling sys.stdout.reconfigure(encoding=sys.stdout._TTY_CODEPAGE) whenever sys.stdout.encoding != sys.stdout._TTY_CODEPAGE is a very sane and correct thing to do.

But that leaves me with the question: just exactly how can I get sys.stdout._TTY_CODEPAGE on Windows, when PEP 528 might-or-might-not be implemented?


Solution

  • (As the OP stated, Mac and Linux *already* expose that information via environment variables, so I'm only providing the Windows solution here.)

    When Python is not overriding your standard input and output with synthetic I/O classes*, the appropriate calls are ctypes.windll.kernel32.GetConsoleCP() for input and ctypes.windll.kernel32.GetConsoleOutputCP() for output. These will return a Microsoft "code page identifier", which must be crudely and awkwardly mapped to a Python "encoding".

    However, neither the PSF nor any alternate implementation includes this mapping by default. There are various community-contributed mappings floating around online, but it's not clear if any of them are correct.

    All of these mappings will be incomplete, anyway, as Python's included "cp437" and "oem" encodings disagree with the actual behavior of Windows CMD on its default codepage 437 on all of {u'\u00a7', u'\u00b6', u'\u203c', u'\u2190', u'\u2191', u'\u2192', u'\u2193', u'\u2195', u'\u21a8', u'\u221f', u'\u2302', u'\u25ac', u'\u25b2', u'\u25ba', u'\u25bc', u'\u25c4', u'\u263a', u'\u263b', u'\u263c', u'\u2640', u'\u2642', u'\u2660', u'\u2663', u'\u2665', u'\u2666', u'\u266b'}. (See this answer for more details, and an "x-microsoft-cp437" codec that bridges the gap.)


    *The best known way to check if Python is overriding a console file with a synthetic class is:

    import io
    import sys
    import warnings
    
    def is_synthetic_tty(f=sys.stdout):
        """Returns True if *f* is a synthetic TTY such as PEP 528; returns False otherwise."""
    
        if not f.isatty():
            return False
    
        if sys.platform.implementation() == 'CPython':
            return not isinstance(f.raw, io.FileIO)
    
        elif sys.platform.implementation() == 'PyPy':
            if sys.pypy_version_info > (7, 3, 15):
                warnings.warn("https://github.com/pypy/pypy/issues/2999")
            return not isinstance(f.raw, io.FileIO)
    
        else:
            warnings.warn(f"Unrecognized implementation, {sys.platform.implementation()!r}")
            return not isinstance(f.raw, io.FileIO)
    

    When this occurs, the console codepage is likely UTF-16LE hiding behind a hard-coded UTF-8 wrapper which does not respect .encoding and .reconfigure(), which means you have no hope of identifying it and should not interact with its encoding at all, instead only writing Unicode Strings into it.

    A core PyPy maintainer has ambiguously suggested both that he might and that he might not implement a solution that's even compatible with this check, so keeping that warning is is absolutely critical. There is not currently a future-proof solution to detect this situation.