apache-spark apache-kafka spark-structured-streaming

How to make sure that spark structured streaming is processing all the data in kafka

I developed a spark structured streaming application that reads data from a Kafka topic, aggregates the data, and then outputs to S3.

Now, I'm trying to find the most appropriate hardware resources necessary for the application to run properly while also minimizing the costs. Finding very little information on how to calculate the right-sizing of the spark cluster knowing the size of the input, I opted for a trial and error strategy. I deploy applications with minimal resources and add resources until the spark application runs in a stable manner.

That being said, how can I make sure that the spark application is able to process all the data in its Kafka input, and that the application is not falling behind? Is there a specific metric to look for? Job duration time vs trigger processing time?

Thank you for your answers!

Solution

Track kafka consumer lag. There should Consumer group created for your Spark streaming job.

> bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group test-consumer-group

  TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG        CONSUMER-ID                                       HOST                           CLIENT-ID
  test-foo                       0          1               3               2          consumer-1-a5d61779-4d04-4c50-a6d6-fb35d942642d   /127.0.0.1                     consumer-1

If you have a metric saving and plotting tools like prometheus and Grafhana

Save the all Kafka metrics including Kafka consumer lag to prometheus/graphite
Use Grafana to query prometheus and plot them on the graph