I am trying to find out what the internal load factor is for the Python sets. For dictionary which uses a hash table with a load factor of 0.66 (2/3) is. The number of buckets start at 8 and when the 6th key is inserted the number of buckets increases to 16 The table below shows the shift in buckets.
bucket | shift |
---|---|
8 | 5 |
16 | 10 |
32 | 21 |
64 | 42 |
128 | 85 |
This can be seen with de following Python code where the size of a dictionary and sets is shows with the getsizeof method:
import sys
d = {}
s = set()
for x in range(25):
d[x] = 1
s.add(x)
print(len(d), sys.getsizeof(d), sys.getsizeof(s))
# of elements | memory used for dict | memory used for sets |
---|---|---|
1 | 232 | 216 |
2 | 232 | 216 |
3 | 232 | 216 |
4 | 232 | 216 |
5 | 232 | 728 |
6 | 360 | 728 |
7 | 360 | 728 |
8 | 360 | 728 |
9 | 360 | 728 |
10 | 360 | 728 |
11 | 640 | 728 |
12 | 640 | 728 |
13 | 640 | 728 |
14 | 640 | 728 |
15 | 640 | 728 |
16 | 640 | 728 |
17 | 640 | 728 |
18 | 640 | 728 |
19 | 640 | 2264 |
20 | 640 | 2264 |
21 | 640 | 2264 |
22 | 1176 | 2264 |
23 | 1176 | 2264 |
24 | 1176 | 2264 |
25 | 1176 | 2264 |
The above table shows that the shift in the buckets correct is for the dictionary, but not for the sets. The memory in sets is different.
I am trying to find out what the load factor is for a set. Is that also 2/3? Or am I doing something wrong with the code?
Currently, it's about 3/5. See the source:
if ((size_t)so->fill*5 < mask*3)
return 0;
return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);
fill
is the number of occupied table cells (including "deleted entry" markers), and mask
is 1 less than the total table capacity.