Search code examples
pythoncachingimplementationcpythonstring-interning

How to create the str "1" at two different memory locations?


We are able to defeat the small integer intern in this way (a calculation allows us to avoid the caching layer):

>>> n = 674039
>>> one1 = 1
>>> one2 = (n ** 9 + 1) % (n ** 9)
>>> one1 == one2
True
>>> one1 is one2
False

How can you defeat the small string intern, i.e. to see the following result:

>>> one1 = "1"
>>> one2 = <???>
>>> type(one2) is str and one1 == one2
True
>>> one1 is one2
False

sys.intern mentions that "Interned strings are not immortal", but there's no context about how a string could kicked out of the intern, or how you can create a str instance avoiding the caching layer.

Since interning is CPython implementation detail, answers relying on undocumented implementation details are ok/expected.


Solution

  • Unicode consisting of only one character (with value smaller than 128 or more precisely from latin1) is the most complicated case, because those strings aren't really interned but (more similar to the integer pool or identically to the behavior for bytes) are created at the start and are stored in an array as long as the interpreter is alive:

    truct _Py_unicode_state {
        ...
        /* Single character Unicode strings in the Latin-1 range are being
           shared as well. */
        PyObject *latin1[256];
        ...
        /* This dictionary holds all interned unicode strings...
        */
        PyObject *interned;
        ...
    };
    

    So every time a length 1 unicode is created, the character value gets looked up if it is in the latin1-array. E.g. in unicode_decode_utf8:

    /* ASCII is equivalent to the first 128 ordinals in Unicode. */
        if (size == 1 && (unsigned char)s[0] < 128) {
            if (consumed) {
                *consumed = 1;
            }
            return get_latin1_char((unsigned char)s[0]);
        }
    

    One could even argue, if there is a way to circumvent this in the interpreter - we speak about a (performance-) bug.

    A possibility is to populate the unicode-data by ourselves using C-API. I use Cython for the proof of concept, but also ctypes could be used to the same effect:

    %%cython
    cdef extern from *:
        """
        PyObject* create_new_unicode(char *ch) 
        {
           PyUnicodeObject *ob = (PyUnicodeObject *)PyUnicode_New(1, 127);
           Py_UCS1 *data = PyUnicode_1BYTE_DATA(ob);
           data[0]=ch[0]; //fill data without using the unicode_decode_utf8
           return (PyObject*)ob;
        }
        """
        object create_new_unicode(char *ch)
        
    def gen1():
        return create_new_unicode(b"1")
    

    Noteworthy details:

    • PyUnicode_New would not look up in latin1, because the characters aren't set yet.
    • For simplicity, the above works only for ASCII characters - thus we pass 127 as maxchar to PyUnicode_New. As result, we can interpret data via PyUnicode_1BYTE_DATA which makes it easy to manipulate it without much ado manually.

    And now:

    a,b=gen1(), gen1()
    a is b, a == b
    # yields (False, True)
    

    as wanted.


    Here is a similar idea, but implemented with ctypes:

    from ctypes import POINTER, py_object, c_ssize_t, byref, pythonapi
    PyUnicode_New = pythonapi.PyUnicode_New
    PyUnicode_New.argtypes = (c_ssize_t, c_ssize_t)
    PyUnicode_New.restype = py_object
    PyUnicode_CopyCharacters = pythonapi._PyUnicode_FastCopyCharacters
    PyUnicode_CopyCharacters.argtypes = (py_object, c_ssize_t, py_object, c_ssize_t, c_ssize_t)
    PyUnicode_CopyCharacters.restype = c_ssize_t
    
    def clone(orig):
        cloned = PyUnicode_New(1,127)
        PyUnicode_CopyCharacters(cloned, 0, orig, 0, 1)
        return cloned
    

    Noteworthy details:

    • It is not possible to use PyUnicode_1BYTE_DATA with ctypes, because it is a macro. An alternative would be to calculate the offset to data-member and directly access this memory (but it depends on the platform and doesn't feel very portable)
    • As work-around, PyUnicode_CopyCharacters is used (there are probably also other possibilities to achieve the same), which is more abstract and portable than directly calculating/accessing the memory.
    • Actually, _PyUnicode_FastCopyCharacters is used, because PyUnicode_CopyCharacters would check, that the target-unicode has multiple references and throw. _PyUnicode_FastCopyCharacters doesn't perform those checks and does as asked.

    And now:

    a="1"
    b=clone(a)
    a is b, a==b
    # yields (False, True)
    

    For strings longer than 1 character, it is a lot easier to avoid interning, e.g.:

    a="12"
    b="123"[0:2]
    a is b, a == b
    #yields (False, True)