Search code examples
pythonpypypython-internals

Implementation specific behavior of `groupby` and argument unpacking


I'm trying to understand a quirk that I discovered in an answer that I wrote up earlier today. Basically, I was yielding groups from a generator function that wrapped itertools.groupby. The interesting thing that I discovered is that if I unpack the generator on the left hand side of an assignment, the last element from the generator remains. For example:

# test_gb.py
from itertools import groupby
from operator import itemgetter

inputs = ((x > 5, x) for x in range(10))

def make_groups(inputs):
    for _, group in groupby(inputs, key=itemgetter(0)):
        yield group

a, b = make_groups(inputs)
print(list(a))
print(list(b))

On Cpython, this results in:

$ python3 ~/sandbox/test_gb.py 
[]
[(True, 9)]

this is the case on both CPython2.7 and CPython3.5.

On PyPy, it results in:

$ pypy ~/sandbox/test_gb.py 
[]
[]

In both cases, the first empty list ("a") is pretty easy to explain -- The group from itertools gets consumed as soon as the next element is required. Since we didn't save those values anywhere, they're lost to the ether.

It seems to me that the PyPy version makes sense for the second empty list ("b") too ... When unpacking, we consume b too (because python needs to look for what's after to make sure that it shouldn't throw a ValueError for the wrong number of items to unpack). For some reason though, the CPython version retains the last element from the input iterable ... Can anyone explain why this might be?

Edit

This is probably more or less obvious, but we can also write it as:

inputs = ((x > 5, x) for x in range(10))
(_, a), (_, b) = groupby(inputs, key=itemgetter(0))
print(list(a))
print(list(b))

and get the same results ...


Solution

  • It's because the groupby object handles the bookkeeping and the grouper objects just reference their key and the parent groupby object:

    typedef struct {
        PyObject_HEAD
        PyObject *it;          /* iterator over the input sequence */
        PyObject *keyfunc;     /* the second argument for the groupby function */
        PyObject *tgtkey;      /* the key for the current "grouper" */
        PyObject *currkey;     /* the key for the current "item" of the iterator*/
        PyObject *currvalue;   /* the plain value of the current "item" */
    } groupbyobject;
    
    typedef struct {
        PyObject_HEAD
        PyObject *parent;      /* the groupby object */
        PyObject *tgtkey;      /* the key value for this grouper object. */
    } _grouperobject;
    

    Since you're not iterating the grouper object when you unpack the groupby object I'll ignore them for now. So what's interesting is what happens in the groupby when you call next on it:

    static PyObject *
    groupby_next(groupbyobject *gbo)
    {
        PyObject *newvalue, *newkey, *r, *grouper;
    
        /* skip to next iteration group */
        for (;;) {
            if (gbo->currkey == NULL)
                /* pass */;
            else if (gbo->tgtkey == NULL)
                break;
            else {
                int rcmp;
    
                rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);
                if (rcmp == 0)
                    break;
            }
    
            newvalue = PyIter_Next(gbo->it);
            if (newvalue == NULL)
                return NULL;   /* just return NULL, no invalidation of attributes */
            newkey = PyObject_CallFunctionObjArgs(gbo->keyfunc, newvalue, NULL);
    
            gbo->currkey = newkey;
            gbo->currvalue = newvalue;
        }
        gbo->tgtkey = gbo->currkey;
    
        grouper = _grouper_create(gbo, gbo->tgtkey);
        r = PyTuple_Pack(2, gbo->currkey, grouper);
        return r;
    }
    

    I removed all the irrelevant exception handling code and removed or simplified pure reference counting stuff. The interesting thing here is that when you reach the end of the iterator the gbo->currkey, gbo->currvalue and gbo->tgtkey aren't set to NULL, they will still point to the last encountered values (the last item of the iterator) because it just return NULL when PyIter_Next(gbo->it) == NULL.

    After this finished you have your two grouper objects. The first one will have a tgtvalue of False and the second with True. Let's have a look what happens when you call next on these groupers:

    static PyObject *
    _grouper_next(_grouperobject *igo)
    {
        groupbyobject *gbo = (groupbyobject *)igo->parent;
        PyObject *newvalue, *newkey, *r;
        int rcmp;
    
        if (gbo->currvalue == NULL) {
            /* removed because irrelevant. */
        }
    
        rcmp = PyObject_RichCompareBool(igo->tgtkey, gbo->currkey, Py_EQ);
        if (rcmp <= 0)
            /* got any error or current group is end */
            return NULL;
    
        r = gbo->currvalue;  /* this accesses the last value of the groupby object */
        gbo->currvalue = NULL;
        gbo->currkey = NULL;
    
        return r;
    }
    

    So remember currvalue is not NULL, so the first if branch isn't interesting. For your first grouper it compares the tgtkey of the grouper and the groupby object and sees that they differ and it will immediatly return NULL. So you got an empty list.

    For the second iterator the tgtkeys are identical, so it will return the currvalue of the groupby object (which is the last encountered value in the iterator!), but this time it will set the currvalue and currkey of the groupby object to NULL.


    Switching back to python: The really interesting quirks happen if you have a grouper with the same tgtkey as the last group in your groupby:

    import itertools
    
    >>> inputs = [(x > 5, x) for x in range(10)] + [(False, 10)]
    >>> (_, g1), (_, g2), (_, g3) = itertools.groupby(inputs, key=lambda x: x[0])
    >>> list(g1)
    [(False, 10)]
    >>> list(g3)
    []
    

    That one element in g1 didn't belong to the first group at all - but because the tgtkey of the first grouper object is False and the last tgtkey is False the first grouper thought it belongs into the first group. It also invalidated the groupby object so the third group is now empty.


    All the code was taken from the Python source code but shortened.