Search code examples
pythonpython-3.xcpythonpython-internals

CPython: Why does += for strings change the id of string variable


Cpython optimizes string increment operations,When initializing memory for a string, the program leaves extra expansion space for it,so, when incrementing, the original string is not copied to the new location. my question is why the id of string variable changes.

>>> s = 'ab'
>>> id(s)
991736112104
>>> s += 'cd'
>>> id(s)
991736774080

why the id of string variable changes.


Solution

  • The optimization you are trying to trigger is an implementation detail of CPython and is a quite subtle thing: there are many details (e.f. one you are experiencing) which can be preventing it.

    For a detailed explanation, one needs to dive into the CPython's implementation, so first I will try to give a hand-waving explanation, which should give at least the gist of what is going on. The gory details will be in the second part which highlights the important code-parts.


    Let's take a look at this function, which exhibits the desired/optimized behavior

    def add_str(str1, str2, n):
        for i in range(n):
            str1+=str2
            print(id(str1))
        return str1
    

    Calling it, leads to the following output:

    >>> add_str("1","2",100)
    2660336425032
    ... 4 times
    2660336425032
    2660336418608
    ... 6 times
    2660336418608
    2660336361520
    ... 6 times
    2660336361520
    2660336281800
     and so on
    

    I.e. a new string is created only every 8 addition, otherwise the old string (or as we will see the memory) is reused. The first id is printed only 6 times because it starts printing when the size of the unicode-object is 2 modulo 8 (and not 0 as in the later cases).

    The first question is, if a string is immutable in CPython, how (or better when) can it be changed? Obviously, we can't change the string if it is bound to different variables - but we could change it, if the current variable is the only one reference - which can be checked pretty easily due to reference counting of CPython (and it is the reason why this optimization isn't available for other implementation which don't use reference counting).

    Let's change the function above by adding a additional reference:

    def add_str2(str1, str2, n):
        for i in range(n):
            ref = str1
            str1+=str2
            print(id(str1))
        return str1
    

    Calling it leads to:

    >>> add_str2("1","2",20)
    2660336437656
    2660337149168
    2660337149296
    2660337149168
    2660337149296
    ... every time a different string - there is copying!
    

    This actually explains your observation:

    import sys
    s = 'ab'
    print(sys.getrefcount(s))
    # 9
    print(id(s))
    # 2660273077752
    s+='a'
    print(id(s))
    # 2660337158664  Different
    

    Your string s is interned (see for example this SO-answer for more information about string interning and integer pool), and thus s is not only one "using" this string and thus this string cannot be changed.

    If we avoid the interning, we can see, that the string is reused:

    import sys
    s = 'ab'*21  # will not be interned
    print(sys.getrefcount(s))
    # 2, that means really not interned
    print(id(s))
    # 2660336107312
    s+='a'
    print(id(s))
    # 2660336107312  the same id!
    

    But how does this optimization works?

    CPython uses its own memory management - the pymalloc allocator, which is optimized for small objects with short lifetimes. The used memory-blocks are multiple of 8 bytes, that means if allocator is asked for only 1 byte, still 8 bytes are marked as used (more precise because of the 8-byte aligment of the returned pointers the the remaining 7 bytes cannot be used for other objects).

    There is however the function PyMem_Realloc: if the allocator is asked to reallocate a 1byte-block as a 2byte-block, there is nothing to do - there were some reserved bytes anyway.

    This way, if there is only one reference to the string, CPython can ask the allocator to reallocate the string and require a byte more. In 7 cases of 8 there is nothing to do for allocator and the additional byte becomes available almost free.

    However, if the size of the string changes by more than 7 bytes, the copying becomes mandatory:

    >>> add_str("1", "1"*8, 20)  # size change of 8
    2660337148912
    2660336695312
    2660336517728
    ... every time another id
    

    Furthermore, pymalloc falls back to PyMem_RawMalloc, which is usually the memory manager of the C-runtime, and the above optimization for strings is no longer possible:

    >>> add_str("1"*512, "1", 20) #  str1 is larger as 512 bytes
    2660318800256
    2660318791040
    2660318788736
    2660318807744
    2660318800256
    2660318796224
    ... every time another id
    

    Actually, whether the addresses are different after each reallocation depends on the memory allocator of the C-runtime and its state. If memory isn't defragmented, the chances are high, that realloc manages to extend memory without copying (but it was not the case on my machine as I did these experiments), see also this SO-post.


    For the curious, here is the whole traceback of the str1+=str2 operation, which can be easily followed in a debugger:

    That is what going on:

    The += is compiled to BINARY_ADD-optcode and when evaluated in ceval.c, there is a hook/special handling for unicode objects (see PyUnicode_CheckExact):

    case TARGET(BINARY_ADD): {
        PyObject *right = POP();
        PyObject *left = TOP();
        PyObject *sum;
        ...
        if (PyUnicode_CheckExact(left) &&
                 PyUnicode_CheckExact(right)) {
            sum = unicode_concatenate(left, right, f, next_instr);
            /* unicode_concatenate consumed the ref to left */
        }
        ...
    

    unicode_concatenate ends up calling PyUnicode_Append, which checks whether the left-operand is modifiable (which basically checks that there is only one reference, string isn't interned and some further stuff) and resizes it or creates a new unicode-object otherwise:

    if (unicode_modifiable(left)
        && ...)
    {
        /* append inplace */
        if (unicode_resize(p_left, new_len) != 0)
            goto error;
    
        /* copy 'right' into the newly allocated area of 'left' */
        _PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
    }
    else {
        ...
        /* Concat the two Unicode strings */
        res = PyUnicode_New(new_len, maxchar);
        if (res == NULL)
            goto error;
        _PyUnicode_FastCopyCharacters(res, 0, left, 0, left_len);
        _PyUnicode_FastCopyCharacters(res, left_len, right, 0, right_len);
        Py_DECREF(left);
        ...
    }
    

    unicode_resize ends up calling resize_compact (mostly because in our case we have only ascii-characters), which ends up calling PyObject_REALLOC:

    ...
    new_unicode = (PyObject *)PyObject_REALLOC(unicode, new_size);
    ...
    

    which basically will be calling pymalloc_realloc:

    static int
    pymalloc_realloc(void *ctx, void **newptr_p, void *p, size_t nbytes)
    {
        ...
        /* pymalloc is in charge of this block */
        size = INDEX2SIZE(pool->szidx);
        if (nbytes <= size) {
            /* The block is staying the same or shrinking.
              ....
                *newptr_p = p;
                return 1; // 1 means success!
              ...
        }
        ...
    }
    

    Where INDEX2SIZE just rounds up to the nearest multiple of 8:

    #define ALIGNMENT               8               /* must be 2^N */
    #define ALIGNMENT_SHIFT         3
    
    /* Return the number of bytes in size class I, as a uint. */
    #define INDEX2SIZE(I) (((uint)(I) + 1) << ALIGNMENT_SHIFT)
    

    qed.