I am looking in the context of joining two streams using Flink and would like to understand how these two streams differ and affect how Flink processes them.
As a related question, I would also like to understand how CoProcessFunction would differ from a KeyedCoProcessFunction.
A KeyedStream
is a DataStream
that has been hash partitioned, with the effect that for any given key, every stream element for that key is in the same partition. This guarantees that all messages for a key are processed by the same worker instance. Only keyed streams can use key-partitioned state and timers.
A KeyedCoProcessFunction
connects two streams that have been keyed in compatible ways -- the two streams are mapped to the same keyspace -- making it possible for the KeyedCoProcessFunction
to have keyed state that relates to both streams. For example, you might want to join a stream of customer transactions with a stream of customer updates -- joining them on the customer_id. You would implement this in Flink (if doing so at a low level) by keying both streams by the customer_id, and connecting those keyed streams with a KeyedCoProcessFunction
.
On the other hand, a CoProcessFunction
has two inputs, but with no particular relationship between those inputs.
The Flink training has tutorials covering keyed streams and connected streams, and a related exercise/example.