amazon-web-services apache-spark pyspark amazon-emr

spark job timing out when trying to save as table on aws emr

We have set up a dedicated cluster for our application on AWS.

This is the configuration of the cores ( we have 2 cores)

m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:64 GiB

Current dataset -

We are trying to run spark job which involves many joins and works with 80 million records each record with 60 + fields

Issue we are facing -

When we trying to save the final dataframe as athena table, its taking more than 1 hour and timing out.

As we are the only one using the cluster, what should be our configuration to ensure that we use all the cluster resources optimally

Current configuration

Executor Memory : 2G
Dynamic Allocation Enabled : true
Number of Executor Cores : 1
Number of Executors : 8
spark.dynamicAllocation.executorIdleTimeout : 3600
spark.sql.broadcastTimeout : 36000

Solution

Looking at your config some observation -

You are using

m5.xlarge which is having 4 vCore, 16 GiB memory

Executor config

Number of Executor Cores : 1
Executor Memory : 2G

So at most 4 executor can spin up, and memory required by 4 executor is 8. So at the end you are not utilizing all the resource.

Also as @Shadowtrooper said, save the data in partition (If possible in Parquet format) if you can, it will also save cost when you query in Athena.