I just read PEP 393 and learned that Python's str
type uses different internal representations, depending on the content. So, I experimented a little bit and was a bit surprised by the results:
>>> sys.getsizeof('')
41
>>> sys.getsizeof('H')
42
>>> sys.getsizeof('Hi')
43
>>> sys.getsizeof('Ö')
61
>>> sys.getsizeof('Öl')
59
I understand that in the first three cases, the strings don't contain any non-ASCII characters, so an encoding with 1 byte per char can be used. Putting a non-ASCII character like Ö
in a string forces the interpreter to use a different encoding. Therefore, I'm not surprised that 'Ö'
takes more space than 'H'
.
However, why does 'Öl'
take less space than 'Ö'
? I assumed that whatever internal representation is used for 'Öl'
allows for an even shorter representation of 'Ö'
.
I'm using Python 3.12, apparently it is not reproducible in earlier versions.
This test code (the structures are only correct according to 3.12.4 source, and even so I didn't quite double-check them)
import ctypes
import sys
class PyUnicodeObject(ctypes.Structure):
_fields_ = [
("ob_refcnt", ctypes.c_ssize_t),
("ob_type", ctypes.c_void_p),
("length", ctypes.c_ssize_t),
("hash", ctypes.c_ssize_t),
("state", ctypes.c_uint64),
]
class StateBitField(ctypes.LittleEndianStructure):
_fields_ = [
("interned", ctypes.c_uint, 2),
("kind", ctypes.c_uint, 3),
("compact", ctypes.c_uint, 1),
("ascii", ctypes.c_uint, 1),
("statically_allocated", ctypes.c_uint, 1),
("_padding", ctypes.c_uint, 24),
]
def __repr__(self):
return ", ".join(f"{k}: {getattr(self, k)}" for k, *_ in self._fields_ if not k.startswith("_"))
def dump_s(s: str):
o = PyUnicodeObject.from_address(id(s))
state_int = o.state
state = StateBitField.from_buffer(ctypes.c_uint64(state_int))
print(f"{s!r}".ljust(8), f"{o.length=}, {sys.getsizeof(s)=}, {state}")
dump_s('5')
dump_s('a')
dump_s('ä')
dump_s('vvv')
dump_s('ÖÖÖ')
dump_s(str(chr(214))) # avoid the string having been interned into module source
dump_s(str(chr(214) + chr(108))) # avoid the string having been interned into module source
prints out
'5' o.length=1, sys.getsizeof(s)=42, interned: 3, kind: 1, compact: 1, ascii: 1, statically_allocated: 1
'a' o.length=1, sys.getsizeof(s)=42, interned: 3, kind: 1, compact: 1, ascii: 1, statically_allocated: 1
'ä' o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
'vvv' o.length=3, sys.getsizeof(s)=44, interned: 2, kind: 1, compact: 1, ascii: 1, statically_allocated: 0
'ÖÖÖ' o.length=3, sys.getsizeof(s)=60, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 0
'Ö' o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
'Öl' o.length=2, sys.getsizeof(s)=59, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 0
'Ö' o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
– the smoking gun seems to be statically_allocated
on Ö
etc..
I think that stems from this line in pycore_runtime_init_generated
where it looks like the runtime statically objects for all Latin-1 strings (among others). As discussed in the comments, this CPython PR added UTF-8 representations of all of these statically allocated strings, so Ö
is statically stored as both Latin-1 (1 character) and UTF-8 (2 characters).
Also, I should note getsizeof()
actually forwards to unicode_sizeof_impl
, it's not just measuring memory.