I'm trying to understand a quirk that I discovered in an answer that I wrote up earlier today. Basically, I was yielding groups from a generator function that wrapped itertools.groupby
. The interesting thing that I discovered is that if I unpack the generator on the left hand side of an assignment, the last element from the generator remains. For example:
# test_gb.py
from itertools import groupby
from operator import itemgetter
inputs = ((x > 5, x) for x in range(10))
def make_groups(inputs):
for _, group in groupby(inputs, key=itemgetter(0)):
yield group
a, b = make_groups(inputs)
print(list(a))
print(list(b))
On Cpython, this results in:
$ python3 ~/sandbox/test_gb.py
[]
[(True, 9)]
this is the case on both CPython2.7 and CPython3.5.
On PyPy, it results in:
$ pypy ~/sandbox/test_gb.py
[]
[]
In both cases, the first empty list ("a
") is pretty easy to explain -- The group from itertools
gets consumed as soon as the next element is required. Since we didn't save those values anywhere, they're lost to the ether.
It seems to me that the PyPy
version makes sense for the second empty list ("b
") too ... When unpacking, we consume b
too (because python needs to look for what's after to make sure that it shouldn't throw a ValueError
for the wrong number of items to unpack). For some reason though, the CPython
version retains the last element from the input iterable ... Can anyone explain why this might be?
Edit
This is probably more or less obvious, but we can also write it as:
inputs = ((x > 5, x) for x in range(10))
(_, a), (_, b) = groupby(inputs, key=itemgetter(0))
print(list(a))
print(list(b))
and get the same results ...
It's because the groupby
object handles the bookkeeping and the grouper
objects just reference their key
and the parent groupby
object:
typedef struct {
PyObject_HEAD
PyObject *it; /* iterator over the input sequence */
PyObject *keyfunc; /* the second argument for the groupby function */
PyObject *tgtkey; /* the key for the current "grouper" */
PyObject *currkey; /* the key for the current "item" of the iterator*/
PyObject *currvalue; /* the plain value of the current "item" */
} groupbyobject;
typedef struct {
PyObject_HEAD
PyObject *parent; /* the groupby object */
PyObject *tgtkey; /* the key value for this grouper object. */
} _grouperobject;
Since you're not iterating the grouper
object when you unpack the groupby
object I'll ignore them for now. So what's interesting is what happens in the groupby
when you call next
on it:
static PyObject *
groupby_next(groupbyobject *gbo)
{
PyObject *newvalue, *newkey, *r, *grouper;
/* skip to next iteration group */
for (;;) {
if (gbo->currkey == NULL)
/* pass */;
else if (gbo->tgtkey == NULL)
break;
else {
int rcmp;
rcmp = PyObject_RichCompareBool(gbo->tgtkey, gbo->currkey, Py_EQ);
if (rcmp == 0)
break;
}
newvalue = PyIter_Next(gbo->it);
if (newvalue == NULL)
return NULL; /* just return NULL, no invalidation of attributes */
newkey = PyObject_CallFunctionObjArgs(gbo->keyfunc, newvalue, NULL);
gbo->currkey = newkey;
gbo->currvalue = newvalue;
}
gbo->tgtkey = gbo->currkey;
grouper = _grouper_create(gbo, gbo->tgtkey);
r = PyTuple_Pack(2, gbo->currkey, grouper);
return r;
}
I removed all the irrelevant exception handling code and removed or simplified pure reference counting stuff. The interesting thing here is that when you reach the end of the iterator the gbo->currkey
, gbo->currvalue
and gbo->tgtkey
aren't set to NULL
, they will still point to the last encountered values (the last item of the iterator) because it just return NULL
when PyIter_Next(gbo->it) == NULL
.
After this finished you have your two grouper
objects. The first one will have a tgtvalue
of False
and the second with True
. Let's have a look what happens when you call next
on these grouper
s:
static PyObject *
_grouper_next(_grouperobject *igo)
{
groupbyobject *gbo = (groupbyobject *)igo->parent;
PyObject *newvalue, *newkey, *r;
int rcmp;
if (gbo->currvalue == NULL) {
/* removed because irrelevant. */
}
rcmp = PyObject_RichCompareBool(igo->tgtkey, gbo->currkey, Py_EQ);
if (rcmp <= 0)
/* got any error or current group is end */
return NULL;
r = gbo->currvalue; /* this accesses the last value of the groupby object */
gbo->currvalue = NULL;
gbo->currkey = NULL;
return r;
}
So remember currvalue
is not NULL
, so the first if
branch isn't interesting. For your first grouper it compares the tgtkey
of the grouper
and the groupby
object and sees that they differ and it will immediatly return NULL
. So you got an empty list.
For the second iterator the tgtkey
s are identical, so it will return the currvalue
of the groupby
object (which is the last encountered value in the iterator!), but this time it will set the currvalue
and currkey
of the groupby
object to NULL
.
Switching back to python: The really interesting quirks happen if you have a grouper
with the same tgtkey
as the last group in your groupby
:
import itertools
>>> inputs = [(x > 5, x) for x in range(10)] + [(False, 10)]
>>> (_, g1), (_, g2), (_, g3) = itertools.groupby(inputs, key=lambda x: x[0])
>>> list(g1)
[(False, 10)]
>>> list(g3)
[]
That one element in g1
didn't belong to the first group at all - but because the tgtkey
of the first grouper object is False
and the last tgtkey
is False
the first grouper thought it belongs into the first group. It also invalidated the groupby
object so the third group is now empty.
All the code was taken from the Python source code but shortened.