Search code examples
pythonpython-3.xmemorycpython

CPython strings larger than 21 chars - memory allocation


I'm wondering what could be the reason of this behaviour (CPython 2.7 and 3.5):

>>> a = 's' ; b = 's'
>>> id(a), id(b)
(4322870976, 4322870976)

String shorter than 21 chars seems to share the same memory address (or id).

>>> a = 's' * 20 ; b = 's' * 20
>>> id(a), id(b)
(4324218680, 4324218680)

From 21 on, this behaviour changes.

>>> a = 's' * 21 ; b = 's' * 21
>>> id(a), id(b)
(4324218536, 4324218608)

I wasn't able to find a reasonable explanation, but according python docs:

E.g., after a = 1; b = 1, a and b may or may not refer to the same object with the value one, depending on the implementation...

After looking over cpython's code, I couldn't find where this decision is made.


Solution

  • The Python compiler converts as many expressions as possible and as makes sense to constants (i.e. it interns them) within bytecode. Constants with the same value will come to have the same id() by this process. This gives the results in the first and second examples.

    But we have to qualify "makes sense". Expressions that are large (e.g. 10**100) result in a lot of space used for their constant result. This means that the compiler includes the expression unmodified in the bytecode and calculates their value at runtime. For strings (and in fact all types) the maximum length is 20, and so the expressions in the third example are evaluated by the VM rather than the compiler.