Search code examples
pythonsetcpythonpypypython-internals

How do CPython and PyPy decide when to resize a set?


When adding elements to sets on CPython and PyPy, when are they resized, and what will be the sizes of the underlying container?

This question is similar in principle to max_load_factor, as C++ describes it for their unordered_map.


Solution

  • CPython uses this check to decide when to resize:

    if (!(so->used > n_used && so->fill*3 >= (so->mask+1)*2))
    

    This basically means that when 2/3 full, the container will resize.

    The resizing itself doubles the amount of space for large sets and quadruples it for small ones:

    return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);
    

    Armin Rigo points out in the comments that PyPy implements its sets with dictionaries, so the relevant resizing code is:

    jit.conditional_call(d.resize_counter <= x * 3,
                         _ll_dict_resize_to, d, num_extra)
    

    This is the same strategy, as resize_counter is the empty space left in the dictionary.


    Before this was pointed this out, I resorted to benchmarking. You can detect resizes by looking for very small pauses. This is somewhat random for small sizes, so you have to be careful to remove the noise. Here's a script that does that:

    from collections import Counter
    import time
    
    iterations = 100
    internal_iterations = 100000
    
    def main():
        resizes = Counter()
    
        for i in range(iterations):
            print("Completion: [{:3.0%}]\r".format(i/iterations), end="")
    
            s = set()
            maxtime_so_far = 0
            for j in range(internal_iterations):
                start = time.time()
                s.add(j)
                end = time.time()
    
                if end-start > maxtime_so_far:
                    maxtime_so_far = end-start
                    resizes[j] += 1
    
        print()
    
        format_string = "{0:<6} = 0b{0:<18b} [found %: {1:2.0%}]"
    
        for index in sorted(resizes):
            chance = resizes[index] / iterations
    
            if chance >= 0.05:
                print(format_string.format(index, chance))
    
    main()
    

    and here's the output for CPython:

    Completion: [99%]
    0      = 0b0                  [found %: 100%]
    5      = 0b101                [found %: 83%]
    21     = 0b10101              [found %: 12%]
    85     = 0b1010101            [found %: 94%]
    341    = 0b101010101          [found %: 97%]
    1365   = 0b10101010101        [found %: 100%]
    5461   = 0b1010101010101      [found %: 100%]
    21845  = 0b101010101010101    [found %: 100%]
    87381  = 0b10101010101010101  [found %: 100%]
    

    You can see the 10101...₂ pattern, which is what you get from a power of two divided by three, which is when the object will resize. (It's resizes one after that, because the script is 0-indexed).

    On PyPy3, changing the number of iterations to be greater (iterations = 1000; internal_iterations = 100000), I get

    Completion: [100%]
    0      = 0b0                  [found %: 78%]
    5      = 0b101                [found %: 6%]
    21     = 0b10101              [found %: 5%]
    341    = 0b101010101          [found %: 24%]
    1365   = 0b10101010101        [found %: 66%]
    5461   = 0b1010101010101      [found %: 100%]
    21845  = 0b101010101010101    [found %: 100%]
    87381  = 0b10101010101010101  [found %: 71%]
    

    This implies the strategy is the same for PyPy.

    Strangely, and possibly due to the JIT or GC, sometimes I get something more like:

    Completion: [100%]
    0      = 0b0                  [found %: 13%]
    5      = 0b101                [found %: 11%]
    21     = 0b10101              [found %: 22%]
    22     = 0b10110              [found %: 6%]
    23     = 0b10111              [found %: 5%]
    24     = 0b11000              [found %: 5%]
    341    = 0b101010101          [found %: 30%]
    1365   = 0b10101010101        [found %: 66%]
    5461   = 0b1010101010101      [found %: 98%]
    

    depending on iteration counts. I imagine this is due to the relatively small number of iterations around that point, and it probably doesn't mean very much. If a GC collection happens around item 20, it's liable to cause that sort of stress.