python python-3.x performance cpython python-internals

Large memory footprint of integers compared with result of sys.getsizeof()

Python-Integer-objects in the range [1,2^30) need 28 byte, as provided by sys.getsizeof() and explained for example in this SO-post.

However, when I measure the memory footprint with the following script:

#int_list.py:
import sys

N=int(sys.argv[1])
lst=[0]*N            # no overallocation

for i in range(N):
    lst[i]=1000+i    # ints not from integer pool

via

/usr/bin/time -fpeak_used_memory:%M python3 int_list.py <N>

I get the following peak memory values (Linux-x64, Python 3.6.2):

   N     Peak memory in Kb        bytes/integer
-------------------------------------------   
   1            9220              
   1e7        404712                40.50 
   2e7        800612                40.52 
   3e7       1196204                40.52
   4e7       1591948                40.52

So it looks as if 40.5 bytes are needed per one integer object, i.e. 12.5 bytes more than yielded by sys.getsizeof().

Additional 8 bytes are easy to explain - the list lst doesn't hold the integer objects, but references to them - that means an additonal pointer, i.e. 8 bytes, are needed.

However, what about the other 4.5 bytes, what are they used for?

The following causes can be ruled out:

size of integer-objects are variable, but 10^7 is smaller than 2^30 and thus all integer will be 28 bytes large.
There is no overallocation in the list lst, which can be easily checked via sys.getsizeof(lst) which yields 8 times the number of elements, plus a very small overhead.

Solution

@Nathan's suggestion is surprisingly not the solution, due to some subtly details of CPython's longint-implementation. With his explanation, the memory footprint for

...
lst[i] = (1<<30)+i

should still be 40.52, because sys.sizeof(1<<30) is 32, but the measurements show it to be 48.56. On the other hand, for

...
lst[i] = (1<<60)+i

the footprint is still 48.56, despite the fact, that sys.sizeof(1<<60) is 36.

The reason: the sys.getsizeof() doesn't tell us the real memory footprint for the result of a summation, i.e. a+b which is

32 bytes for 1000+i
36 bytes for (1<<30)+i
40 bytes for (1<<60)+i

That happens because when two integers are added in x_add, the resulting integer has at first one "digit", i.e. 4 bytes, more than the maximum of a and b:

static PyLongObject *
x_add(PyLongObject *a, PyLongObject *b)
{
    Py_ssize_t size_a = Py_ABS(Py_SIZE(a)), size_b = Py_ABS(Py_SIZE(b));
    PyLongObject *z;
    ...
    /* Ensure a is the larger of the two: */
    ...
    z = _PyLong_New(size_a+1);  
    ...

after the addition the result is normalized:

 ...
 return long_normalize(z);

};

i.e. the possible leading zeros are discarded, but the memory isn't released - 4 bytes aren't worth it, the source of the function can be found here.

Now, we can use @Nathans insight to explain, why the footprint of (1<<30)+i is 48.56 and not 44.xy: The used py_malloc-allocator uses memory-blocks with alignment of 8 bytes, that means 36 bytes will be stored in a block of size 40 - the same as the result of (1<<60)+i (keep the additional 8-bytes for pointers in mind).

To explain the remaining 0.5 bytes we need to dive deeper into details of py_malloc-allocator. A good overview is the source-code itself, my last try to describe it can be found in this SO-post.

In a nutshell, the allocator manages memory in arenas, each 256MB. When an arena is allocated, the memory is reserved, but not commited. We see memory as "used", only when a so called pool is touched. A pool is 4Kb big (POOL_SIZE) and is used only for memory-blocks with same size - in our case 32 byte. That means the resolution of peak_used_memory is 4Kb and cannot be responsible for those 0.5 bytes.

However, these pools must be managed, which leads to an additional overhead: py_malloc needs a pool_header per pool:

/* Pool for small blocks. */
struct pool_header {
    union { block *_padding;
            uint count; } ref;          /* number of allocated blocks    */
    block *freeblock;                   /* pool's free list head         */
    struct pool_header *nextpool;       /* next pool of this size class  */
    struct pool_header *prevpool;       /* previous pool       ""        */
    uint arenaindex;                    /* index into arenas of base adr */
    uint szidx;                         /* block size class index        */
    uint nextoffset;                    /* bytes to virgin block         */
    uint maxnextoffset;                 /* largest valid nextoffset      */
};

The size of this struct is 48 (called POOL_OVERHEAD) bytes on my Linux_64 machine. This pool_header is a part of the pool (a quite smart way to avoid additional allocation via cruntime-memory-allocator) and will take place of two 32-byte-blocks, that means a pool has place for 126 32 byte integers:

/* Return total number of blocks in pool of size index I, as a uint. */
#define NUMBLOCKS(I) ((uint)(POOL_SIZE - POOL_OVERHEAD) / INDEX2SIZE(I))

Which leads to:

4Kb/126 = 32.51 bytes footprint for 1000+i, plus additional 8 bytes for the pointer.
(30<<1)+i needs 40 bytes, that means 4Kb has place for 102 blocks, of which one (there are remaining 16 bytes when pool is divided in 40-bytes block, and they can be used for the pool_header) is used forpool_header, which leads to 4Kb/101=40.55 bytes (plus 8 byte pointer).

We can also see, that there are some additional overhead, responsible for ca. 0.01 byte per integer - not big enough for me to care.