Search code examples
apache-flinkflink-streamingflink-cep

Apache Flink 1.3.2 connectivity issue with Kafka 1.1.0


I'm using a cluster of Apache Flink 1.3.2. We're consuming Kafka messages and since upgrading the broker to 1.1.0 (from 0.10.2) we noticed this error in the log frequently:

 ERROR o.a.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase  - Async Kafka commit failed.
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing offsets.
Caused by: org.apache.kafka.common.errors.DisconnectException: null

Due to this sometimes we experience missing events during processing. We use FlinkKafkaConsumer010 in the job.

Checkpointing is enabled (Interval 10 s, Timeout 1 minute, Minimum pause between checkpoints 5s, Maximum concurrent checkpoints 1. E2E duration on average is under 1s, under half a second even I'd say.) Same settings were used with Kafka 0.10.2 where we don't have this exception.

Update: We have reinstalled Kafka and now we get a warning message but still no events are read

WARN  o.a.flink.streaming.connectors.kafka.internal.Kafka09Fetcher  - Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity.

Solution

  • Turns out this was caused by some connection issues we had in AWS. The framework works well with Kafka 1.1