Search code examples
pythonsetcpythonpypytimeit

Why does set subtraction and .difference() run at different speeds


To find the difference between two sets there is the - operator and .difference(). I'm using this code to time each of those:

import timeit

print(timeit.timeit('''
a.difference({b})
                    ''', setup='''
a = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
b = 3
                    '''))
# => 0.2423834060318768

print(timeit.timeit('''
a - {b}
                    ''', setup='''
a = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
b = 3
                    '''))
# => 0.2027170000365004

When I run this in CPython I get this:

0.24530324200168252
0.205820870003663

This made sense to me because .difference() can take any iterable, not just sets. However, when I run it in PyPy I get this:

0.14613953093066812
0.23659668595064431

The times are completely flipped, so surely it can't be because .difference() can take any interable. What is the difference between the implementations of .difference() and -? Is there any difference between CPython and PyPy's implementations?


Solution

  • In PyPy, there is an optimization that is invoked by difference() but not by __sub__(): when you use a.difference(b) and b is a smaller set than a, then it copies a completely and removes the items from b. If it's not the case, it starts from an empty set and adds the items of a that are not in b. For some reason __sub__() doesn't go through the path that selects between these two implementations, and always picks the second logic.

    Please report it on https://foss.heptapod.net/pypy/pypy/-/issues and it will likely be fixed.