From:
https://github.com/spotify/scio/wiki/Scio-data-guideline
"Prefer combine/aggregate/reduce transforms over groupByKey. Keep in mind that a reduce operation must be associative and commutative."
Why in particular would one prefer an aggregate over a groupByKey?
Combine, aggregation, and reduce transforms are preferred over groupByKey because the former are more memory efficient during pipeline execution. This is due to the implementation of the primitive GroupByKey
and Combine
transforms in Apache Beam. The answer to this question isn't necessarily specific to Scio.
GroupByKey
requires that all key-value pairs remain in memory, which could result in OutOfMemoryError
s. All key-value pairs remain in memory per window. groupByKey
uses Beam's primitive GroupByKey
transform.
Aggregations remove the need to hold all values in memory because values are continually combined/reduced during the execution of the transform. Values are combined/reduced in a non-deterministic order, which is why all combine/reduce operations must be associative. Scio's implementation of aggregateByKey
uses Beam's primitive Combine
transform.
References:
1. Scio groupByKey
2. Scio aggregateByKey
3. Apache Beam GroupByKey
4. Apache Beam Combine
5. Google Cloud Dataflow Combine