Search code examples
apache-sparkpysparkaws-glueaws-glue-sparkspark-ui

Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?


I have this code snippet that I ran locally in standalone mode using 100 records only:

from awsglue.context import GlueContext
glue_context = GlueContext(sc)
glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table)
df = glue_df.toDF()
print(df.count())

The schema contains 89 columns all having string data type except 5 columns that have array of struct data type. The data size is 3.1 MB.

Also, here is some info about the environment used to run the code:

  • spark.executor.cores: 2
  • spark.executor.id: driver
  • spark.driver.memory: 1000M

Problem is I can't find out why stage 1 took 12 minutes to finish where it only has to count 100 records. I can't find what "Scan parquet" and "Exchange" Tasks mean as shown in this image: Stage 1 DAG Visualization

My question is, is there a more systematic way to understand what those tasks mean. As a beginner, I heavily relied on Spark UI but it doesn't give much information about the tasks it has executed. I was able to find which task took the most time but I have no idea why it is the case and how to systematically resolve it.


Solution

  • The running time in spark code is calculating based on the cluster kick-off time, DAG scheduler optimisation time, running stages time. In your case, the issue could be because of the followings:

    • The number of parquet files. To test this easily read the table and write it back as one parquet file. You are calling a table but behind the scene, it's reading the physical parquet files so the number of files is an item to consider.
    • Number of spark clusters. The number of clusters should be a relevant number of computing resources you have. For example, in your case, you have 2 core with a small-size table. So it's more efficient to have just a few partitions instead of the default partition numbers which is 200.

    To get more clarification on the spark stages use explain function and read the DAG result. As a result of this function you could see and compare Analyzed Logical Plan, Optimized Logical Plan, and Physical Plan that has been calculated by internal optimiser processes.

    To find a more detailed description of the explain function please visit this LINK