Search code examples
apache-kafkaintelapache-flinkflink-streamingfault-tolerance

Fault Tolerance of FlinkKafkaConsumer in HiBench


I am running some experiments to test the fault tolerance capabilities of Apache Flink. I am currently using the HiBench framework with the WordCount micro benchmark implemented for Flink.

I noticed that if I kill a TaskManager during an execution, the state of the Flink operators is recovered after the automatic "redeploy" but many (all?) tuples sent from the benchmark to Kafka are missed (stored in Kafka but not received in Flink).

It seems that after the recovery, the FlinkKafkaConsumer (the benchmark uses FlinkKafkaConsumer08) in place of start reading from the last offset read before the failure start reading from the latest available one (losing all the event sent during the failure).

Any suggestion?

Thanks!


Solution

  • The problem was with the HiBench framework itself and with the latest version of Flink.

    I had to update the version of Flink in the benchmark in order to use the "setStartFromGroupOffsets()" method in the Kafka consumer.