python grouping dictionary-comprehension

dict comprehension discards elements

from itertools import groupby
from operator import itemgetter

d = [{'k': 'v1'}]

r = ((k, v) for k, v in groupby(d, key=itemgetter('k')))
for k, v in r:
    print(k, list(v)) # v1 [{'k': 'v1'}]

print('---')
r = {k: v for k, v in groupby(d, key=itemgetter('k'))}
for k, v in r.items():
    print(k, list(v)) # v1 []

Seems like some quirk, or am I missing something?

Solution

This is a documented part of itertools.groupby:

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list

In other words, you need to access the group before getting the next item in the iterator -- in this case, from inside the dict comprehension. To use it in a dict comprehension you need to make the list in the comprehension:

from itertools import groupby
from operator import itemgetter

d = [{'k': 'v1'}]

r = {k: list(v) for k, v in groupby(d, key=itemgetter('k'))}
for k, v in r.items():
    print(k, v) # v1 [{'k': 'v1'}]

In your first example, because you are using a generator expression, you don't actually start iterating the groupby iterator until you start the for loop. However, you would have the same issue if you used a non-lazy list comprehension instead of a generator (i.e. r = [(k, v) for k, v in groupby(d, key=itemgetter('k'))]).

Why does it work this way?

Preserving lazy iteration is the motivating idea behind itertools. Because it is dealing with (possibly large, or infinite) iterators, it never wants to store any values in memory. It just calls next() on the underlying iterator and does something with that value. Once you've called next() you can't go back to earlier values (without storing them, which itertools doesn't want to do).

With groupby it's easier to see with an example. Here is a simple generator that makes alternating ranges of positive and negative numbers and a groupby iterator that groups them:

def make_groups():
    i = 1
    while True:
        for n in range(1, 10):
            print("yielding: ", n*i)
            yield n * i
        i *= -1

g = make_groups()

grouper = groupby(g, key=lambda x: x>0)

make_groups prints a line each time next() is called before yielding the value to help know what's happening. When we call next() on grouper this results in a next call to g and gets out first group and value:

> k, gr = next(grouper)
yielding:  1

Now each next() call on gr results in a next() call to the underlying g as you can see from the print:

> next(gr)
1              # already have this value from the initial next(grouper)
> next(gr)
yielding:  2   # gets the next value and clicks the underlying generator to the next yield
2

Now look what happens if we call next() on grouper to get the next group:

> next(grouper)

yielding:  3
yielding:  4
yielding:  5
yielding:  6
yielding:  7
yielding:  8
yielding:  9
yielding:  -1

Groupby is iterated through the generator until it hit a value that changed the key. The values have been yielded by g. We can no longer the next value of gr (ie. 3) unless somehow we stored all those values or we somehow t-ed off the underlying g generator into two independent generators. Neither of these are good solutions for the default implementation (especially since the point of itertools is not to do this), so it leaves it up to you to do this, but you need to store these values before something causes next(grouper) to be called and advance the generator past the values you wanted.