Context: I built a Tree data structure that stores single characters in its Nodes in cython. Now I'm wondering whether I can save save memory if I intern all those characters. And whether I should use Py_UNICODE as the variable type or regular str. This is my stripped-down Node object, using Py_UNICODE:
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
cdef class Node():
cdef:
public Py_UNICODE character
def __init__(self, Py_UNICODE character):
self.character = character
def memory(self):
return <uintptr_t>&self.character
If first tried to see whether the characters are interned automatically. If I import that class in Python and create multiple objects with different or the same character, these are the results I get:
a = Node("a")
a_py = a.character
a2 = Node("a")
b = Node("b")
print(a.memory(), a2.memory(), b.memory())
# 140532544296704 140532548558776 140532544296488
print(id(a.character), id(a2.character), id(b.character), id(a_py))
# 140532923573504 140532923573504 140532923840528 140532923573504
So from that I would have concluded that Py_UNICODE is not interned automatically and that using id() in python does not give me the actual memory address, but that of a copy (I suppose python does intern single unicode characters automatically and then just returns the memory address of that to me).
Next I tried doing the same thing when using str instead. Simply replacing Py_UNICODE with str didn't work, so this is how I'm trying to do it now:
%%cython
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
cdef class Node():
cdef:
public str character
def __init__(self, str character):
self.character = character
def memory(self):
return <uintptr_t>(<PyObject*>self.character)
And these are the results I get with that:
...
print(a.memory(), a2.memory(), b.memory())
# 140532923573504 140532923573504 140532923840528
print(id(a.character), id(a2.character), id(b.character), id(a_py))
# 140532923573504 140532923573504 140532923840528 140532923573504
Based on that, I first thought that single character str are interned in cython as well and that cython just doesn't need to copy the characters from python, explaining why id() and .memory() give the same address. But then I tried using longer strings got the same results, from which I probably don't want to conclude that longer strings are also interned automatically? It's also the case that my tree uses less memory if I use Py_UNICODE, so that doesn't make much sense if str were interned, but Py_UNICODE isn't. Can someone explain this behaviour to me? And how would I go about interning?
(I'm testing this in Jupyter, in case that makes a difference)
Edit: Removed an unnecessary id comparison of nodes instead of characters.
There is a misunderstanding from your side. PY_UNICODE
isn't a python-object - it is a typedef for wchar_t
.
Only string objects (at least some of them) get interned, but not simple C-variables of type wchar_t
(or as matter of fact of any C-type). It wouldn't also not make any sense: a wchar_t
is most probably 32bit large, while keeping a pointer to an interned object would cost 64bit.
Thus, the memory address of the variable self.character
(of type PY_UNICODE
) is never the same as long as self
are different objects (no matter which value self.character
has).
On the other hand, when you call a.character
in pure python, Cython knows, that the variable is not a simple 32bit integer and converts it automatically (character
is property right?) to an unicode-object via PyUnicode_FromOrdinal
. The returned string (i.e. a_py
) might be "interned" or not.
When the code point of this character is less than 256 (i.e. latin1), it gets kind of interned - otherwise not. The first 256 unicode-objects consisting of only one character have a special place - not the same as other interned strings (thus the usage of "interned" in the previous section).
Consider:
>>> a="\u00ff" # ord(a) = 255
>>> b="\u00ff"
>>> a is b
# True
but
>>> a="\u0100" # ord(a) = 256
>>> b="\u0100"
>>> a is b
# False
The key take away from this is: use PY_UNICODE
- it is cheaper (4 bytes) even if not interned than interned strings/unicode-objects (8 bytes for reference + once the memory for the interned object) and much cheaper than not interned objects (which can happen).
Or better, as @user2357112 has pointed out, use Py_UCS4
to ensure that the size of 4 bytes is guaranteed (which are needed to be able to support all possible unicode-characters) - wchar_t
could be as small as 1 byte (even if this is probably pretty unusual nowdays). If you know more about the used characters, you could fall back to Py_UCS2
or Py_UCS1
.
However, when using Py_UCS2
or Py_USC1
one must take into consideration, that Cython will no support the conversion from/to unicode as in the case of Py_UCS4
(or deprecated Py_UNICODE
) and one will have to do it by hand, for example:
%%cython
from libc.stdint cimport uint16_t
# need to wrap typedef as Cython doesn't do it
cdef extern from "Python.h":
ctypedef uint16_t Py_UCS2
cdef class Node:
cdef:
Py_UCS2 character_
@property
def character(self):
# cython will do the right thing for Py_USC4
return <Py_UCS4>(self.character_)
def __init__(self, str character):
# unicode -> Py_UCS4 managed by Cython
# Py_UCS4 -> Py_UCS2 is a simple C-cast
self.character_ = <Py_UCS2><Py_UCS4>(character)
One should also make sure, that using Py_USC2
really saves memory: CPython uses pymalloc which has alignment of 8 bytes, that means that an object of e.g. 20 bytes will still use 24 bytes (3*8) memory. Another issue is alignment for structs coming from C-compiler, for
struct A{
long long int a;
long long int b;
char ch;
};
sizeof(A)
is 24 instead of 17 (see live).
If one is really after these two bytes, then there is a bigger fish to fry: Don't make nodes Python-objects as it brings the overhead of 16 bytes for not really needed polymorphism and reference counting - that means the whole data structure should be written with C and wrappend as whole in Python. However also here make sure to allocate memory in right fashion: the usual C-runtime memory allocators have 32 or 64 bytes alignment, i.e. allocation smaller sizes still leads to usage of 32/64 bytes.