I ran a query in databricks, and the notebook it says it took 12 seconds. Meanwhile when I navigated through the spark UI to see the SQL execution plan it reports 12104 ms, which it matches. But the phases the execution plan is composed of say that they took as long as 30 seconds for a single phase.
Here is the execution plan in tabular format. And down below it's on graphical format. If you notice on the tabular format, at the very top left corner it says Completed in 12104 ms
but then if you read the breakdown you can see that one of the *PhotonShuffleMapStage
step (which is made of operations (7)
and (8)
) reports it took 29.6s
....
So how come the entire query say it took 12 seconds when a single of its phases says 29.6 seconds?
Completed in 12104 ms
Node | # Tasks | Duration total (min, med, max) | # Rows | Est # Rows | Peak Mem total (min, med, max)
----------------------------------------------------------+---------+-----------------------------------------------------+-------------+------------+----------------------------------------------------------------
*PhotonShuffleMapStage | 4 | 1.6 s (0 ms, 0 ms, 448 ms (stage 56.0: task 186)) | 0 | - | 824.0 MiB (0.0 B, 0.0 B, 206.0 MiB (stage 56.0: task 187))
-*(1): PhotonScan parquet hive_metastore.default.user | - | 241 ms (0 ms, 0 ms, 77 ms (stage 56.0: task 188)) | 10,000,000 | - | 64.0 MiB (0.0 B, 0.0 B, 16.0 MiB (stage 56.0: task 187))
-*(2): PhotonShuffleExchangeSink | - | 1.6 s (0 ms, 0 ms, 448 ms (stage 56.0: task 186)) | 10,000,000 | - | 248.0 MiB (0.0 B, 0.0 B, 62.0 MiB (stage 56.0: task 187))
*PhotonShuffleMapStage | 8 | 10.0 s (0 ms, 1.4 s, 1.5 s (stage 60.0: task 201)) | 0 | - | 1626.0 MiB (0.0 B, 210.0 MiB, 212.0 MiB (stage 60.0: task 203))
-*(5): AQEShuffleRead | - | - | - | - | -
-*(6): PhotonShuffleExchangeSource | - | 136 ms (0 ms, 17 ms, 24 ms (stage 60.0: task 203)) | 10,000,000 | - | -
-*(11): AQEShuffleRead | - | - | - | - | -
-*(12): PhotonShuffleExchangeSource | - | 2.3 s (0 ms, 301 ms, 370 ms (stage 60.0: task 201)) | 320,000,000 | - | -
-*(13): PhotonShuffledHashJoin | - | 9.9 s (0 ms, 1.4 s, 1.5 s (stage 60.0: task 201)) | 329,951,357 | - | 602.0 MiB (0.0 B, 82.0 MiB, 84.0 MiB (stage 60.0: task 203))
-*(14): PhotonProject | - | 10.0 s (0 ms, 1.4 s, 1.5 s (stage 60.0: task 201)) | 329,951,357 | - | -
-*(15): PhotonAgg | - | 10.0 s (0 ms, 1.4 s, 1.5 s (stage 60.0: task 201)) | 8 | - | -
-*(16): PhotonShuffleExchangeSink | - | 10.0 s (0 ms, 1.4 s, 1.5 s (stage 60.0: task 201)) | 8 | - | 32.0 MiB (0.0 B, 4.0 MiB, 4.0 MiB (stage 60.0: task 200))
*PhotonShuffleMapStage | 8 | 29.6 s (0 ms, 0 ms, 4.8 s (stage 57.0: task 195)) | 0 | - | 7.9 GiB (0.0 B, 0.0 B, 1158.0 MiB (stage 57.0: task 194))
-*(7): PhotonScan parquet hive_metastore.default.revision | - | 2.9 s (0 ms, 0 ms, 474 ms (stage 57.0: task 193)) | 320,000,000 | - | 128.0 MiB (0.0 B, 0.0 B, 16.0 MiB (stage 57.0: task 190))
-*(8): PhotonShuffleExchangeSink | - | 29.6 s (0 ms, 0 ms, 4.8 s (stage 57.0: task 195)) | 320,000,000 | - | 6.8 GiB (0.0 B, 0.0 B, 1014.0 MiB (stage 57.0: task 194))
*PhotonResultStage | 1 | 0 ms | 0 | - | 128.0 MiB
-*(19): PhotonShuffleExchangeSource | - | 0 ms | 8 | 1 | -
-*(20): PhotonAgg | - | 0 ms | 1 | 1 | -
*WholeStageCodegen (1) | 1 | 0 ms | - | - | -
-*(22): ColumnarToRow | - | - | 1 | - | -
(40): AdaptiveSparkPlan | - | - | - | - | -
Lets have a look at the interesting line that seems to take way too long:
Node | # Tasks | Duration total (min, med, max) | # Rows | Est # Rows | Peak Mem total (min, med, max)
-*(8): PhotonShuffleExchangeSink | - | 29.6 s (0 ms, 0 ms, 4.8 s (stage 57.0: task 195)) | 320,000,000 | - | 6.8 GiB (0.0 B, 0.0 B, 1014.0 MiB (stage 57.0: task 194))
Indeed, this total duration is 29.6s. How can this be part of a bigger query that finished in 12s?
Well, the Duration total
value refers to the total time that all tasks took. It is a simple sum over all of those timings. But what you're not getting as information here is that these tasks ran in parallel.
Let's take a very simple example: If you have 2 tasks that ran 5s each in parallel, start to end will have taken 5s. But the total duration (across all tasks) was 10s!
So you need some information on how many CPU your cluster had (which dictates how many tasks run in parallel) to have an idea on what the start to finish time was.
Have a look at the Stages
tab in the top of your Spark UI and select stage 8 (the long running one). Check out the starting timestamps of the tasks and the ending timestamps of the tasks in there, that should clear it up :)