Search code examples
pythonmemory-managementprogramming-pearls

Python Data Structure memory footprint behaving weird


I was trying out the one of the programming pearls:

Given a file containing at most ten million 7-digit integers with no duplicates. What is an efficient way to print these numbers in ascending order using just 1.5Mb RAM and reading the data just once? What are the consequences of only having 1Mb of RAM and no other storage? How would your answer change if duplicates were permitted?

In order to create a test case I, generated 8999999 numbers and wrote them to a file. Then for each line, i started inserting the same to a tree, finally creating a trie structure.

Sample Code:

from sys import getsizeof

tree = dict()
xtree = dict()
f = open("data2.txt", "r")
cnt = 0
for number in f:
    cnt += 1
    currTree = tree
    xtree[number] = dict()
    for n in number.strip():
        if n not in currTree:
            currTree[n] = dict()
        currTree = currTree[n]
f.close()

print(cnt)
print(getsizeof(tree))
print(getsizeof(xtree))
print(tree)

sample file data2.txt has 20 records

The tree generated is

Generated Tree

Now the question is that when i do a memory sizing of the tree that is built, at 20 lines it shows a memory foot print of 240 bytes

At 100 line, size of tree becomes 368 bytes

and at 8999999 lines also it gives 368 bytes

I built an auxiliary map named xtree which just feeds in the data

The sizes for xtree and tree are in bytes.

Data analysis

can anyone please explain how this is so..??


Solution

  • Your tree is just a dict with up to 10 key-value pairs. In bigger tree, there aren't any more key-value pairs. There are more values inside the values inside the … inside the key-value pairs, but there's still only 10 key-value pairs in the dict. And a dict with around 10 key-value pairs taking 368 bytes seems like about what you should expect.1

    As the docs for getsizeof say:

    Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

    See recursive sizeof recipe for an example of using getsizeof() recursively to find the size of containers and all their contents.

    Since you don't actually have a completely arbitrary data structure, but just a dict of dicts of etc. And, while you do have some shared references (e.g., if you read the number 1234567 while you already have an int with the same value in memory, Python will just reuse the same object), if you're trying to verify that you can fit into 1.5MB, you really want a worst-case measurement, so you probably want to skip the check for already-seen values.

    So, you can write something simpler instead of using that recipe if you want. But the idea will be the same:

    def total_dict_size(d):
        size = sys.getsizeof(d)
        if isinstance(d, dict):
            for key, value in d.items():
                size += sys.getsizeof(key) + total_dict_size(value)
        return size
    

    Your xtree, on the other hand, is a dict with 8999999 key-value pairs. Doing the same back-of-the-envelope calculation, I'd expect that to be a bit under 300MB. Instead, it's a bit over 300MB. Close enough.

    And you're also storing the 8999999 7-digit integers on the heap. To take some nice round numbers, let's say there are 5M distinct integers that don't fall into the handful of small values pre-created and cached by CPython. Each of those integers is small enough to fit into one 30-bit digit, so they take 28 bytes apiece on 64-bit CPython. So, that's another 140MB not accounted for in sys.getsizeof(xtree) (but they are accounted for—in fact, over-accounted, with the worst-case-measuring implementation given) if you call the recursive function above on either tree or xtree.

    So, your total memory use between tree, xtree, and the actual integers is probably somewhere on the order of 750MB, which doesn't quite fit the < 1.5MB requirement.


    1. Every Python object has some fixed header overhead, for things like the refcount, the pointer to the type, etc., plus type-specific things, like the length for most container types. Call that 64 bytes. A dict then has a hash table. It needs to be a bit bigger than 10 slots, to keep the load well below 1.0; call it 13 slots. Each slot needs a hash value, a reference to the key, and a reference to the value, so that's 3 pointers, or 24 bytes. 64 + 13 * 24 = 376. So that back-of-the-envelope calculation is off by only 8 bytes…