amazon-web-services apache-spark pyspark apache-spark-sql aws-glue

Expected Run time of AWS Glue job

I run a job in AWS glue on 1mb of data. It takes 2.5 seconds to complete.

Pyspark framework was used for the job.

So going by this, on 1gb of data, the job should take around 2.5 * 1000 = 2500 seconds to complete.

But when I run the job on 1gb of data it took only 20 seconds. How is this possible?

Solution

By default Glue job is configured to run with 10 DPUs where each DPU has 16 GB RAM and 4 vCores. So in your case even if you are running the job with 2 DPUs you are still under utilising the cluster.

And the execution time does't really work as you calculated and there are lot of additional factors to it.If you want to read more about panning resources for Glue then refer to this link.