Search code examples
pythonpython-c-apicpython

Why and where python interned strings when executing `a = 'python'` while the source code does not show that?


I am trying to learn the intern mechanism of python using in the implementation of string object. But in both PyObject *PyString_FromString(const char *str)andPyObject *PyString_FromStringAndSize(const char *str, Py_ssize_t size) python interned strings only when its size is 0 or 1.

PyObject *
PyString_FromString(const char *str)
{
    fprintf(stdout, "creating %s\n", str);------------[1]
    //...
    //creating...
    /* share short strings */
    if (size == 0) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        nullstring = op;
        Py_INCREF(op);
    } else if (size == 1) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        characters[*str & UCHAR_MAX] = op;
        Py_INCREF(op);
    }
    return (PyObject *) op;
}

But for longer strings like a ='python', if I modified the string_print to print the address, it is identical to the one of another string varable b = 'python. And at the line marked as [1] above, I print a piece of log when python creating a string object showing multiple strings are created when executing a ='python' just without 'python'.

>>> a = 'python'
creating stdin
creating stdin
string and size creating (null)
string and size creating a = 'python'
?
creating a
string and size creating (null)
string and size creating (null)
creating __main__
string and size creating (null)
string and size creating (null)
creating <stdin>
string and size creating d
creating __lltrace__
creating stdout
[26691 refs]
creating ps1
creating ps2

So where is string 'python' created and interned?

Update 1

Plz refer to the comment by @Daniel Darabos for a better interpretation. It is a more understandable way to ask this question.

The following is the output of PyString_InternInPlace after adding a log print command.

PyString_InternInPlace(PyObject **p)
{
    register PyStringObject *s = (PyStringObject *)(*p);
    fprintf(stdout, "Interning ");
    PyObject_Print(s, stdout, 0);
    fprintf(stdout, "\n");
    //...
}
>>> x = 'python'
Interning 'cp936'
Interning 'x'
Interning 'cp936'
Interning 'x'
Interning 'python'
[26706 refs]

Solution

  • The string literal is turned into a string object by the compiler. The function that does that is PyString_DecodeEscape, at least in Py2.7, you haven't said what version you are working with.

    Update:

    The compiler interns some strings during compilation, but it is very confusing when it happens. The string needs to have only identifier-ok characters:

    >>> a = 'python'
    >>> b = 'python'
    >>> a is b
    True
    >>> a = 'python!'
    >>> b = 'python!'
    >>> a is b
    False
    

    Even in functions, string literals can be interned:

    >>> def f():
    ...   return 'python'
    ...
    >>> def g():
    ...   return 'python'
    ...
    >>> f() is g()
    True
    

    But not if they have funny characters:

    >>> def f():
    ...   return 'python!'
    ...
    >>> def g():
    ...   return 'python!'
    ...
    >>> f() is g()
    False
    

    And if I return a pair of strings, none of them are interned, I don't know why:

    >>> def f():
    ...   return 'python', 'python!'
    ...
    >>> def g():
    ...   return 'python', 'python!'
    ...
    >>> a, b = f()
    >>> c, d = g()
    >>> a is c
    False
    >>> a == c
    True
    >>> b is d
    False
    >>> b == d
    True
    

    Moral of the story: interning is an implementation-dependent optimization that depends on many factors. It can be interesting to understand how it works, but never depend on it working any particular way.