Search code examples
apache-kafkaspark-streaming

Why spark streaming is running on previous topics records?


I ran the zookeeper and kafka broker but I didn't run kafka producer. I ran spark streaming code and print the nonfiltered stream here. My question is, why I'm receiving these stream of data, namely

{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}

{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}

Although I'm not running the producer? What do these messages mean bellow?

19/06/24 20:20:00 INFO JobScheduler: Finished job streaming job 1561378800000 ms.0 from job set of time 1561378800000 ms
19/06/24 20:20:00 INFO JobScheduler: Total delay: 0.028 s for time 1561378800000 ms (execution: 0.021 s)
19/06/24 20:20:00 INFO MapPartitionsRDD: Removing RDD 161 from persistence list
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1716
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1893
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1944
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
{"vehicleId":"0","lon":"0","lat":"0","ts":"0"}
...

19/06/24 20:20:00 INFO KafkaRDD: Removing RDD 160 from persistence list
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1628
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1781
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1570
19/06/24 20:20:00 INFO BlockManager: Removing RDD 161
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1808
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 2020
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1624
19/06/24 20:20:00 INFO ContextCleaner: Cleaned accumulator 1918
19/06/24 20:20:00 INFO ContextCleaner: Cleaned

Solution

  • You may want to check what you've set in 'auto.offset.reset'.

    From the Spark Streaming Guide:

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092,anotherhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "use_a_separate_group_id_for_each_stream",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    

    They set the offset reset to "latest". Yours seems to be set to earliest.