When using Apache Beam Python with GCP Dataflow, is there a downside to materializing the results of GroupByKey, say, to count the number of elements. For example:
def consume_group_by_key(element):
season, fruits = element
for fruit in fruits:
yield f"{fruit} grows in {season}"
def consume_group_by_key_materialize(element):
season, fruits = element
num_fruits = len(list(fruits))
print(f"There are {num_fruits} fruits grown in {season}")
for fruit in fruits:
yield f"{fruit} grows in {season}"
(
pipeline
| 'Create produce counts' >> beam.Create([
('spring', 'strawberry'),
('spring', 'carrot'),
('spring', 'eggplant'),
('spring', 'tomato'),
('summer', 'carrot'),
('summer', 'tomato'),
('summer', 'corn'),
('fall', 'carrot'),
('fall', 'tomato'),
('winter', 'eggplant'),
])
| 'Group counts per produce' >> beam.GroupByKey()
| beam.ParDo(consume_group_by_key_generator)
)
Are the grouped values, fruits
, passed to my DoFn as a generator? Is there a performance penalty for using consume_group_by_key_materialize
instead of consume_group_by_key
? Or in other words materializing fruits
via something like len(list(fruits))
? If there are billions of fruits will this use up all my memory?
You are correct, len(list(fruits))
will materialize the entire list in memory before taking it's size, whereas your consume_group_by_key
function iterates over the iterable once and (on distributed runners like Dataflow) does not require bringing the entire set of values into memory at once.