I have two Table Sources listen to a Kafka topic using Flink. On the generated graph, the data exchange for the join between these two tables is "hash". However, I did not found any information how the hash works (on a specific field ?) and how it can be configured ?
In the case of a Table join, the hash data exchange is based on the equality clause of the join. E.g., if you are doing
SELECT *
FROM A, B
WHERE A.id = B.id
then the stream records from both streams will be hashed on the id field. This will guarantee that all records from both A and B with the same id
value will be sent to the same downstream instance. (This is why Flink only supports joins with an equality predicate -- they are much more readily scalable.)
Internally this turns into a keyBy
from Flink's DataStream API. A Table/SQL GROUP BY
works the same way.