Search code examples
apache-flink

Clarification about GroupCombine with respect to partial results


Flink's documentation for GroupCombine states:

Note: The GroupCombine on a Grouped DataSet is performed in memory with a greedy strategy which may not process all data at once but in multiple steps. It is also performed on the individual partitions without a data exchange like in a GroupReduce transformation. This may lead to partial results.

With the following remark for full (non-grouped) DataSets:

The GroupCombine on a full DataSet works similar to the GroupCombine on a grouped DataSet. The data is partitioned on all nodes and then combined in a greedy fashion (i.e. only data fitting into memory is combined at once).

Does this mean that if my dataset consists of, for example:

1
2
3

and I want to generate all pairwise combinations

(1, 2), (1, 3), (2, 3)

I cannot implement this in a general way with a GroupCombine transformation because it doesn't guarantee that the whole group will fit in a given partition's memory?


Solution

  • GroupCombine is a non-deterministic operation in Flink. It is typically used to perform partial compuations (like aggregations) and is followed by a deterministic operation like GroupReduce that consumes the partial results. GroupCombine is typically used to reduce the cost of the deterministic operation by performing less-expensive local, in-memory computations.

    If you need compute deterministic results on groups of records, you should use GroupReduce.