I am trying to learn the intern mechanism of python using in the implementation of string object. But in both PyObject *PyString_FromString(const char *str)
andPyObject *PyString_FromStringAndSize(const char *str, Py_ssize_t size)
python interned strings only when its size is 0 or 1.
PyObject *
PyString_FromString(const char *str)
fprintf(stdout, "creating %s\n", str);------------[1]
/* share short strings */
if (size == 0) {
PyObject *t = (PyObject *)op;
op = (PyStringObject *)t;
nullstring = op;
} else if (size == 1) {
PyObject *t = (PyObject *)op;
op = (PyStringObject *)t;
characters[*str & UCHAR_MAX] = op;
return (PyObject *) op;
But for longer strings like a ='python'
, if I modified the string_print
to print the address, it is identical to the one of another string varable b = 'python
. And at the line marked as [1] above, I print a piece of log when python creating a string object showing multiple strings are created when executing a ='python'
just without 'python'.
>>> a = 'python'
creating stdin
creating stdin
string and size creating (null)
string and size creating a = 'python'
creating a
string and size creating (null)
string and size creating (null)
creating __main__
string and size creating (null)
string and size creating (null)
creating <stdin>
string and size creating d
creating __lltrace__
creating stdout
[26691 refs]
creating ps1
creating ps2
So where is string 'python' created and interned?
Update 1
Plz refer to the comment by @Daniel Darabos for a better interpretation. It is a more understandable way to ask this question.
The following is the output of PyString_InternInPlace
after adding a log print command.
PyString_InternInPlace(PyObject **p)
register PyStringObject *s = (PyStringObject *)(*p);
fprintf(stdout, "Interning ");
PyObject_Print(s, stdout, 0);
fprintf(stdout, "\n");
>>> x = 'python'
Interning 'cp936'
Interning 'x'
Interning 'cp936'
Interning 'x'
Interning 'python'
[26706 refs]
The string literal is turned into a string object by the compiler. The function that does that is PyString_DecodeEscape
, at least in Py2.7, you haven't said what version you are working with.
The compiler interns some strings during compilation, but it is very confusing when it happens. The string needs to have only identifier-ok characters:
>>> a = 'python'
>>> b = 'python'
>>> a is b
>>> a = 'python!'
>>> b = 'python!'
>>> a is b
Even in functions, string literals can be interned:
>>> def f():
... return 'python'
>>> def g():
... return 'python'
>>> f() is g()
But not if they have funny characters:
>>> def f():
... return 'python!'
>>> def g():
... return 'python!'
>>> f() is g()
And if I return a pair of strings, none of them are interned, I don't know why:
>>> def f():
... return 'python', 'python!'
>>> def g():
... return 'python', 'python!'
>>> a, b = f()
>>> c, d = g()
>>> a is c
>>> a == c
>>> b is d
>>> b == d
Moral of the story: interning is an implementation-dependent optimization that depends on many factors. It can be interesting to understand how it works, but never depend on it working any particular way.