python string memory cython string-interning

Interning and memory address for str and Py_UNICODE in cython

Context: I built a Tree data structure that stores single characters in its Nodes in cython. Now I'm wondering whether I can save save memory if I intern all those characters. And whether I should use Py_UNICODE as the variable type or regular str. This is my stripped-down Node object, using Py_UNICODE:

from libc.stdint cimport uintptr_t
from cpython cimport PyObject

cdef class Node():
    cdef:
        public Py_UNICODE character

    def __init__(self, Py_UNICODE character):
        self.character = character

    def memory(self):
        return <uintptr_t>&self.character

If first tried to see whether the characters are interned automatically. If I import that class in Python and create multiple objects with different or the same character, these are the results I get:

a = Node("a")
a_py = a.character
a2 = Node("a")
b = Node("b")

print(a.memory(), a2.memory(), b.memory())
# 140532544296704 140532548558776 140532544296488

print(id(a.character), id(a2.character), id(b.character), id(a_py))
# 140532923573504 140532923573504 140532923840528 140532923573504

So from that I would have concluded that Py_UNICODE is not interned automatically and that using id() in python does not give me the actual memory address, but that of a copy (I suppose python does intern single unicode characters automatically and then just returns the memory address of that to me).

Next I tried doing the same thing when using str instead. Simply replacing Py_UNICODE with str didn't work, so this is how I'm trying to do it now:

%%cython
from libc.stdint cimport uintptr_t
from cpython cimport PyObject

cdef class Node():
    cdef:
        public str character

    def __init__(self, str character):
        self.character = character

    def memory(self):
        return <uintptr_t>(<PyObject*>self.character)

And these are the results I get with that:

...
print(a.memory(), a2.memory(), b.memory())
# 140532923573504 140532923573504 140532923840528

print(id(a.character), id(a2.character), id(b.character), id(a_py))
# 140532923573504 140532923573504 140532923840528 140532923573504

Based on that, I first thought that single character str are interned in cython as well and that cython just doesn't need to copy the characters from python, explaining why id() and .memory() give the same address. But then I tried using longer strings got the same results, from which I probably don't want to conclude that longer strings are also interned automatically? It's also the case that my tree uses less memory if I use Py_UNICODE, so that doesn't make much sense if str were interned, but Py_UNICODE isn't. Can someone explain this behaviour to me? And how would I go about interning?

(I'm testing this in Jupyter, in case that makes a difference)

Edit: Removed an unnecessary id comparison of nodes instead of characters.

Solution

There is a misunderstanding from your side. PY_UNICODE isn't a python-object - it is a typedef for wchar_t.

Only string objects (at least some of them) get interned, but not simple C-variables of type wchar_t (or as matter of fact of any C-type). It wouldn't also not make any sense: a wchar_t is most probably 32bit large, while keeping a pointer to an interned object would cost 64bit.

Thus, the memory address of the variable self.character (of type PY_UNICODE) is never the same as long as self are different objects (no matter which value self.character has).

On the other hand, when you call a.character in pure python, Cython knows, that the variable is not a simple 32bit integer and converts it automatically (character is property right?) to an unicode-object via PyUnicode_FromOrdinal. The returned string (i.e. a_py) might be "interned" or not.

When the code point of this character is less than 256 (i.e. latin1), it gets kind of interned - otherwise not. The first 256 unicode-objects consisting of only one character have a special place - not the same as other interned strings (thus the usage of "interned" in the previous section).

Consider:

>>> a="\u00ff" # ord(a) =  255
>>> b="\u00ff"
>>> a is b
# True

but

>>> a="\u0100" # ord(a) =  256
>>> b="\u0100"
>>> a is b
# False

The key take away from this is: use PY_UNICODE - it is cheaper (4 bytes) even if not interned than interned strings/unicode-objects (8 bytes for reference + once the memory for the interned object) and much cheaper than not interned objects (which can happen).

Or better, as @user2357112 has pointed out, use Py_UCS4 to ensure that the size of 4 bytes is guaranteed (which are needed to be able to support all possible unicode-characters) - wchar_t could be as small as 1 byte (even if this is probably pretty unusual nowdays). If you know more about the used characters, you could fall back to Py_UCS2 or Py_UCS1.

However, when using Py_UCS2 or Py_USC1 one must take into consideration, that Cython will no support the conversion from/to unicode as in the case of Py_UCS4 (or deprecated Py_UNICODE) and one will have to do it by hand, for example:

%%cython 
from libc.stdint cimport uint16_t

# need to wrap typedef as Cython doesn't do it
cdef extern from "Python.h":
    ctypedef uint16_t Py_UCS2

cdef class Node:
    cdef:
        Py_UCS2 character_

    @property
    def character(self):
        # cython will do the right thing for Py_USC4
        return <Py_UCS4>(self.character_) 

    def __init__(self, str character):
        # unicode -> Py_UCS4 managed by Cython
        # Py_UCS4 -> Py_UCS2 is a simple C-cast
        self.character_ = <Py_UCS2><Py_UCS4>(character)

One should also make sure, that using Py_USC2 really saves memory: CPython uses pymalloc which has alignment of 8 bytes, that means that an object of e.g. 20 bytes will still use 24 bytes (3*8) memory. Another issue is alignment for structs coming from C-compiler, for

struct A{
    long long int a;
    long long int b;
    char ch;
};

sizeof(A) is 24 instead of 17 (see live).

If one is really after these two bytes, then there is a bigger fish to fry: Don't make nodes Python-objects as it brings the overhead of 16 bytes for not really needed polymorphism and reference counting - that means the whole data structure should be written with C and wrappend as whole in Python. However also here make sure to allocate memory in right fashion: the usual C-runtime memory allocators have 32 or 64 bytes alignment, i.e. allocation smaller sizes still leads to usage of 32/64 bytes.