join stream apache-flink distributed-system

Apache Flink connect versus join

In Apache Flink stream processing how is the join operation different than connect, and consequently how CoProcessFunction and ProcessJoinFunction differ, is that the onTimer function provided by CoProcessFunction? Can you please provide an example use case suitable for join / connect in a mutual exclusive way.

Solution

The difference is quite significant. In the case of Join, it works more or less like SQL Inner Join, so You need to provide fields that will be used for joining, also Join is evaluated over the window.

So, basically You define the key for each window that will be used to join and the window that will be used to evaluate results. The ProcessJoinFunction allows You processing the joined elements after they are processed but You don't have control over the join mechanism itself i.e. the pair of already joined elements will be passed to the ProcessJoinFunction

In case of connect You also can define keys (You don't have to though), but those will be used not for the join but to control flow through parallel instances of operators and for keyed state. So, in case of connect, there is no logic responsible for how the elements will be connected, but rather for each element from stream1 method processElement1 will be called and for each element from stream2 method processElement2 will be called. So, if You want to perform some kind of join You will have to implement the logic Yourself.