apache-spark streaming analytics spark-streaming metrics

Spark Streaming Processing Time vs Total Delay vs Processing Delay

I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?

I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.

I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.

Solution

Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:

inputDStream.flatMap(line => line.split(" "))
            .map(word => (word, 1))
            .reduceByKey(_ + _)
            .saveAsTextFile("hdfs://...")

Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.
Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.
Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.

A live example from a working Streaming application:

We see that:

The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.