I have a job with parallelism 2; it gets data from a kafka topic and, after keying, it handles timers in a stateful function.
I observed that sometimes one parallelized instance gets stuck: as a result timers do not trigger until a new message arrives, moving forward the current watermark for that parallel instance.
How does Flink split data between parallel instances? Is there a metric to explore to get a quick view of how messages are split? (in percent or a count)
A part from reducing parallelism to 1, is there any other tip to solve this issue?
Thanks
With the Kafka source, it depends on the number of partitions. So setting the parallelism higher than the number of partitions will stop the watermark moving forward. In your case, as you mentioned it only gets stuck sometimes, probably one of the partitions didn't receive data for a bit which again stops the watermark.
To solve this issue, you can use withIdleness
with your watermark strategy, more details can be found in the docs.