Search code examples
apache-sparkstreaminganalyticsspark-streamingmetrics

Spark Streaming Processing Time vs Total Delay vs Processing Delay


I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?

I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.

I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.


Solution

  • Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:

    inputDStream.flatMap(line => line.split(" "))
                .map(word => (word, 1))
                .reduceByKey(_ + _)
                .saveAsTextFile("hdfs://...")
    
    • Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.

    • Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.

    • Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.

    A live example from a working Streaming application:

    Streaming application

    We see that:

    • The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
    • If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.