Search code examples
apache-sparkhiveapache-kafkaspark-streamingkafka-consumer-api

Kafka consumer to read data from topic when from and until offset is known


Can i know if kafka consumer can read specific records when from and until offsets of partitions of a topic are known.

Use case is in my spark streaming application few batch are not processed(inserted to table) in this case i want to read only missed data. I am storing the topics details i.e partitions and offsets.

Can someone let me know if this can be achieved reading from topic when offsets are known.


Solution

  • If you want to process set of messages, that is defined by starting and ending offset in spark streaming you can use following code:

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "groupId"
    )
    val offsetRanges = Array(
      OffsetRange("input", 0, 2, 4) // <-- topic name, partition number, fromOffset, untilOffset
    )
    
    val sparkContext: SparkContext = ???
    val rdd = KafkaUtils.createRDD(sparkContext, kafkaParams.asJava, offsetRanges, PreferConsistent)
    // other proccessing and saving
    

    More details regarding integration spark streaming and Kafka can be found: https://spark.apache.org/docs/2.4.0/streaming-kafka-0-10-integration.html