Search code examples
pythonperformancesetpython-internals

Why is A.issuperset(B) much slower than all(b in A for b in B)?


Consider testing whether a set A is a superset of an iterable B, once with the set's method for exactly that, and once with my own expression using the definition of superset:

>>> A = set(range(1000))
>>> B = range(-1000, 0)
>>> A.issuperset(B)
False
>>> all(b in A for b in B)
False

Now let's time that:

>>> from timeit import timeit
>>> timeit(lambda: A.issuperset(B))
52.666367300000005
>>> timeit(lambda: all(b in A for b in B))
0.9698789999999917

The set's own method is much slower. Why? Presumably it can/should do the same thing, but at C speed, so should be faster.

I'm using CPython 3.8.1.


Solution

  • The implementation of set.issubset and set.issuperset insists on building a set out of the argument first:

    static PyObject *
    set_issubset(PySetObject *so, PyObject *other)
    {
        setentry *entry;
        Py_ssize_t pos = 0;
        int rv;
    
        if (!PyAnySet_Check(other)) {
            PyObject *tmp, *result;
            tmp = make_new_set(&PySet_Type, other);
            if (tmp == NULL)
                return NULL;
            result = set_issubset(so, tmp);
            Py_DECREF(tmp);
            return result;
        }
        if (PySet_GET_SIZE(so) > PySet_GET_SIZE(other))
            Py_RETURN_FALSE;
    
        while (set_next(so, &pos, &entry)) {
            rv = set_contains_entry((PySetObject *)other, entry->key, entry->hash);
            if (rv < 0)
                return NULL;
            if (!rv)
                Py_RETURN_FALSE;
        }
        Py_RETURN_TRUE;
    }
    
    PyDoc_STRVAR(issubset_doc, "Report whether another set contains this set.");
    
    static PyObject *
    set_issuperset(PySetObject *so, PyObject *other)
    {
        PyObject *tmp, *result;
    
        if (!PyAnySet_Check(other)) {
            tmp = make_new_set(&PySet_Type, other);
            if (tmp == NULL)
                return NULL;
            result = set_issuperset(so, tmp);
            Py_DECREF(tmp);
            return result;
        }
        return set_issubset((PySetObject *)other, (PyObject *)so);
    }
    
    PyDoc_STRVAR(issuperset_doc, "Report whether this set contains another set.");
    

    issuperset makes a set out of the argument, then calls other.issubset(self). (issubset also insists on having a set as its argument, but it gets one, so it doesn't need to convert in this case.) They could have fairly easily added a code path to issuperset to handle non-set arguments without a set conversion, but they didn't.

    I suspect the reason for this may be to throw an error on calls like {1}.issuperset([2, [3]]), where the argument contains unhashable elements. However, it's also likely that no one bothered to optimize it. Searching the Python issue tracker for issuperset turns up 0 issues about optimizing issuperset, not even closed issues. There's a closed issue about a more difficult optimization for issubset, surprisingly, but while it would have caused similar exception behavior changes, none of the replies on the issue said anything about that.