apache-flink flink-streaming system-design

Flink operator design advice

I have some data that I need to process and aggregate, ideally avoiding data skew but I am unsure how to design a topology that would do so.

As an example, my data would look like this:

struct Object {
    int a
    int b
    int c
}

Now, the goal is to aggregate data via combinations of int a and b, and determine how many times int c has occured. So as example, say we have data:

Object(1,2,3)
Object(1,2,1)
Object(1,4,1)
Object(1,4,1)

The final goal is to look like

Result(1,2,{3:1, 1:1}) // Key 3 and 1 both occured only once
Result(1,4,{1:2})     // Key 1 has occured twice in the datastream

My current approach, is to KeyBy a Tuple2 of int a and b, then use a session window to aggregate the occurences of int c and return output. I am using a session window because the same int a and b combinations happen during similar time frames, one can understand this as an user's action.

However, this presents a dataskew problem as I am struggling to distribute workload evenly. Is there a better design to approach this problem? Thanks.

Solution

If you know for sure that data skew is a problem, then you could first do a count of occurrences using all three values as the key, and then do a second aggregation of those results with a keyBy of just the first two values.

Though this would only help if you know that you're getting a reasonable number of occurrences for each unique key from all three values.