Python-Integer-objects in the range [1,2^30)
need 28
byte, as provided by sys.getsizeof()
and explained for example in this SO-post.
However, when I measure the memory footprint with the following script:
#int_list.py:
import sys
N=int(sys.argv[1])
lst=[0]*N # no overallocation
for i in range(N):
lst[i]=1000+i # ints not from integer pool
via
/usr/bin/time -fpeak_used_memory:%M python3 int_list.py <N>
I get the following peak memory values (Linux-x64, Python 3.6.2):
N Peak memory in Kb bytes/integer
-------------------------------------------
1 9220
1e7 404712 40.50
2e7 800612 40.52
3e7 1196204 40.52
4e7 1591948 40.52
So it looks as if 40.5
bytes are needed per one integer object, i.e. 12.5
bytes more than yielded by sys.getsizeof()
.
Additional 8
bytes are easy to explain - the list lst
doesn't hold the integer objects, but references to them - that means an additonal pointer, i.e. 8
bytes, are needed.
However, what about the other 4.5
bytes, what are they used for?
The following causes can be ruled out:
10^7
is smaller than 2^30
and thus all integer will be 28
bytes large.lst
, which can be easily checked via sys.getsizeof(lst)
which yields 8
times the number of elements, plus a very small overhead.@Nathan's suggestion is surprisingly not the solution, due to some subtly details of CPython's longint
-implementation. With his explanation, the memory footprint for
...
lst[i] = (1<<30)+i
should still be 40.52
, because sys.sizeof(1<<30)
is 32
, but the measurements show it to be 48.56
. On the other hand, for
...
lst[i] = (1<<60)+i
the footprint is still 48.56
, despite the fact, that sys.sizeof(1<<60)
is 36
.
The reason: the sys.getsizeof()
doesn't tell us the real memory footprint for the result of a summation, i.e. a+b
which is
1000+i
(1<<30)+i
(1<<60)+i
That happens because when two integers are added in x_add
, the resulting integer has at first one "digit", i.e. 4 bytes, more than the maximum of a
and b
:
static PyLongObject *
x_add(PyLongObject *a, PyLongObject *b)
{
Py_ssize_t size_a = Py_ABS(Py_SIZE(a)), size_b = Py_ABS(Py_SIZE(b));
PyLongObject *z;
...
/* Ensure a is the larger of the two: */
...
z = _PyLong_New(size_a+1);
...
after the addition the result is normalized:
...
return long_normalize(z);
};
i.e. the possible leading zeros are discarded, but the memory isn't released - 4 bytes aren't worth it, the source of the function can be found here.
Now, we can use @Nathans insight to explain, why the footprint of (1<<30)+i
is 48.56
and not 44.xy
: The used py_malloc
-allocator uses memory-blocks with alignment of 8
bytes, that means 36
bytes will be stored in a block of size 40
- the same as the result of (1<<60)+i
(keep the additional 8-bytes for pointers in mind).
To explain the remaining 0.5
bytes we need to dive deeper into details of py_malloc
-allocator. A good overview is the source-code itself, my last try to describe it can be found in this SO-post.
In a nutshell, the allocator manages memory in arenas, each 256MB. When an arena is allocated, the memory is reserved, but not commited. We see memory as "used", only when a so called pool
is touched. A pool is 4Kb
big (POOL_SIZE
) and is used only for memory-blocks with same size - in our case 32
byte. That means the resolution of peak_used_memory
is 4Kb and cannot be responsible for those 0.5
bytes.
However, these pools must be managed, which leads to an additional overhead: py_malloc
needs a pool_header
per pool:
/* Pool for small blocks. */
struct pool_header {
union { block *_padding;
uint count; } ref; /* number of allocated blocks */
block *freeblock; /* pool's free list head */
struct pool_header *nextpool; /* next pool of this size class */
struct pool_header *prevpool; /* previous pool "" */
uint arenaindex; /* index into arenas of base adr */
uint szidx; /* block size class index */
uint nextoffset; /* bytes to virgin block */
uint maxnextoffset; /* largest valid nextoffset */
};
The size of this struct is 48
(called POOL_OVERHEAD
) bytes on my Linux_64 machine. This pool_header
is a part of the pool (a quite smart way to avoid additional allocation via cruntime-memory-allocator) and will take place of two 32
-byte-blocks, that means a pool has place for 126
32
byte integers:
/* Return total number of blocks in pool of size index I, as a uint. */
#define NUMBLOCKS(I) ((uint)(POOL_SIZE - POOL_OVERHEAD) / INDEX2SIZE(I))
Which leads to:
4Kb/126 = 32.51
bytes footprint for 1000+i
, plus additional 8 bytes for the pointer.(30<<1)+i
needs 40
bytes, that means 4Kb
has place for 102
blocks, of which one (there are remaining 16
bytes when pool is divided in 40
-bytes block, and they can be used for the pool_header
) is used forpool_header
, which leads to 4Kb/101=40.55
bytes (plus 8
byte pointer).We can also see, that there are some additional overhead, responsible for ca. 0.01
byte per integer - not big enough for me to care.