I'm using a cluster of Apache Flink 1.3.2. We're consuming Kafka messages and since upgrading the broker to 1.1.0 (from 0.10.2) we noticed this error in the log frequently:
ERROR o.a.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - Async Kafka commit failed.
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing offsets.
Caused by: org.apache.kafka.common.errors.DisconnectException: null
Due to this sometimes we experience missing events during processing. We use FlinkKafkaConsumer010 in the job.
Checkpointing is enabled (Interval 10 s, Timeout 1 minute, Minimum pause between checkpoints 5s, Maximum concurrent checkpoints 1. E2E duration on average is under 1s, under half a second even I'd say.) Same settings were used with Kafka 0.10.2 where we don't have this exception.
Update: We have reinstalled Kafka and now we get a warning message but still no events are read
WARN o.a.flink.streaming.connectors.kafka.internal.Kafka09Fetcher - Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity.
Turns out this was caused by some connection issues we had in AWS. The framework works well with Kafka 1.1