Search code examples
apache-sparkspark-structured-streamingspark-kafka-integration

Spark Offset Management in Kafka


I am using Spark Structured Streaming (Version 2.3.2). I need to read from Kafka Cluster and write into Kerberized Kafka. Here I want to use Kafka as offset checkpointing after the record is written into Kerberized Kafka.

Questions:

  1. Can we use Kafka for checkpointing to manage offset or do we need to use only HDFS/S3 only?

Please help.


Solution

  • Can we use Kafka for checkpointing to manage offset

    No, you cannot commit offsets back to your source Kafka topic. This is described in detail here and of course in the official Spark Structured Streaming + Kafka Integration Guide.

    or do we need to use only HDFS/S3 only?

    Yes, this has to be something like HDFS or S3. This is explained in section Recovering from Failures with Checkpointing of the StructuredStreaming Programming Guide: "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."