python out-of-memory generator google-cloud-dataflow apache-beam

When using Apache Beam Python with GCP Dataflow, does it matter if you materialize the results of GroupByKey?

When using Apache Beam Python with GCP Dataflow, is there a downside to materializing the results of GroupByKey, say, to count the number of elements. For example:

def consume_group_by_key(element):
    season, fruits = element

    for fruit in fruits:
        yield f"{fruit} grows in {season}"


def consume_group_by_key_materialize(element):
    season, fruits = element

    num_fruits = len(list(fruits))
    print(f"There are {num_fruits} fruits grown in {season}")

    for fruit in fruits:
        yield f"{fruit} grows in {season}"


(
    pipeline
    | 'Create produce counts' >> beam.Create([
        ('spring', 'strawberry'),
        ('spring', 'carrot'),
        ('spring', 'eggplant'),
        ('spring', 'tomato'),
        ('summer', 'carrot'),
        ('summer', 'tomato'),
        ('summer', 'corn'),
        ('fall', 'carrot'),
        ('fall', 'tomato'),
        ('winter', 'eggplant'),
    ])
    | 'Group counts per produce' >> beam.GroupByKey()
    | beam.ParDo(consume_group_by_key_generator)
)

Are the grouped values, fruits, passed to my DoFn as a generator? Is there a performance penalty for using consume_group_by_key_materialize instead of consume_group_by_key? Or in other words materializing fruits via something like len(list(fruits))? If there are billions of fruits will this use up all my memory?

Solution

You are correct, len(list(fruits)) will materialize the entire list in memory before taking it's size, whereas your consume_group_by_key function iterates over the iterable once and (on distributed runners like Dataflow) does not require bringing the entire set of values into memory at once.