Search code examples
pythonstringpython-3.xencodingpython-internals

How to find out internal string encoding?


From PEP 393 I understand that Python can use multiple encodings internally when storing strings: latin1, UCS-2, UCS-4. Is it possible to find out what encoding is used to store a particular string, e.g. in the interactive interpreter?


Solution

  • The only way you can test this from the Python layer (without resorting to manually mucking about with object internals via ctypes or Python extension modules) is by checking the ordinal value of the largest character in the string, which determines whether the string is stored as ASCII/latin-1, UCS-2 or UCS-4. A solution would be something like:

    def get_bpc(s):
        maxordinal = ord(max(s, default='\0'))
        if maxordinal < 256:
            return 1
        elif maxordinal < 65536:
            return 2
        else:
            return 4
    

    You can't actually rely on sys.getsizeof because, for non-ASCII strings (even one byte per character strings that fit in the latin-1 range), the string might or might not have populated the UTF-8 representation of the string, and tricks like adding an extra character to it and comparing sizes could actually show the size decrease, and it can actually happen "at a distance", so you're not directly responsible for the existence of the cached UTF-8 form on the string you're checking. For example:

    >>> e = 'é'
    >>> sys.getsizeof(e)
    74
    >>> sys.getsizeof(e + 'a')
    75
    >>> class é: pass  # One of several ways to trigger creation/caching of UTF-8 form
    >>> sys.getsizeof(e)
    77  # !!! Grew three bytes even though it's the same variable
    >>> sys.getsizeof(e + 'a')
    75  # !!! Adding a character shrunk the string!