Search code examples
scalaapache-beamdataflowspotify-scio

Why in Scio do you prefer aggregate over groupByKey?


From:

https://github.com/spotify/scio/wiki/Scio-data-guideline

"Prefer combine/aggregate/reduce transforms over groupByKey. Keep in mind that a reduce operation must be associative and commutative."

Why in particular would one prefer an aggregate over a groupByKey?


Solution

  • Combine, aggregation, and reduce transforms are preferred over groupByKey because the former are more memory efficient during pipeline execution. This is due to the implementation of the primitive GroupByKey and Combine transforms in Apache Beam. The answer to this question isn't necessarily specific to Scio.

    GroupByKey requires that all key-value pairs remain in memory, which could result in OutOfMemoryErrors. All key-value pairs remain in memory per window. groupByKey uses Beam's primitive GroupByKey transform.

    Aggregations remove the need to hold all values in memory because values are continually combined/reduced during the execution of the transform. Values are combined/reduced in a non-deterministic order, which is why all combine/reduce operations must be associative. Scio's implementation of aggregateByKey uses Beam's primitive Combine transform.

    References:
    1. Scio groupByKey
    2. Scio aggregateByKey
    3. Apache Beam GroupByKey
    4. Apache Beam Combine
    5. Google Cloud Dataflow Combine