Can somebody explain this non-monotonic memory usage of a dictionary in CPython 2.7?
>>> import sys
>>> sys.getsizeof({})
280
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5})
280
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6})
1048
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7})
1048
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'e
ight': 8})
664
>>> sys.getsizeof({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'e
ight': 8, 'nine': 9})
664
Python3 is reasonable here, it prints the size of {'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7}
as 480.
I tried this on Ubuntu 15.10 and OS X 10.11.
TLDR: The 6- and 7-entry dict literals presize the hash table badly and then quadruple the size on resize.
When CPython 2.7 evaluates a dict literal, before it starts filling in entries, the opcode it uses to create the dict is BUILD_MAP
. This takes one argument, a hint for how many entries the dict will contain, which it uses to presize the dict:
TARGET(BUILD_MAP)
{
x = _PyDict_NewPresized((Py_ssize_t)oparg);
PUSH(x);
if (x != NULL) DISPATCH();
break;
}
This is intended to minimize the number of times the dict is resized during creation, but since they didn't account for the load factor, it doesn't quite eliminate resizes.
As the source code comments indicate, _PyDict_NewPresized
is intended to "Create a new dictionary pre-sized to hold an estimated number of elements". The exact size of the hash table in the created dict is influenced by a number of implementation details, such as the minimum size (#define PyDict_MINSIZE 8
) and the requirement that the size be a power of 2 (to avoid needing division in the implementation).
For dict literals up to 7 entries, _PyDict_NewPresized
initializes an 8-entry hash table; for 8 entries, it initializes a 16-entry hash table, since the resize routine it uses always picks a capacity bigger than the argument.
Dicts resize on insertion when they become at least 2/3 full. For the 6- and 7-entry dict literals, the dict starts off with 8 entries, so a resize occurs on the 6th insertion. The dict is small enough that the resize quadruples the size of the hash table:
return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);
mp->ma_used
is the number of used entries in the hash table, 6 at this point. 6 is less than 50000, so we call dictresize(mp, 4 * 6)
, which resizes the hash table to 32 entries, the smallest power of 2 greater than 24.
In contrast, for the 8-entry dict literal, the hash table started off with 16 entries. The dict doesn't become 2/3 full during creation, so the initial 16-entry hash table survives the dict creation, and the resulting dict is smaller than with the 6- and 7-entry dict literals.
Python 3 uses a different growth policy, among other dict implementation changes, which is why you saw different results in Python 3.