I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.
I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show()
operation or when I try to perform operations on the spark dataframe I am getting errors like -
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
Task 0 in stage 0.0 failed 1 times; aborting job
java.lang.OutOfMemoryError: GC overhead limit exceeded
My questions are -
htop
, I see that 5GB out of 8GB is already in use. Why is this?If you are running spark as standalone mode (local[*]) from master then it will only use master node.
How are you submitting spark job?
Use yarn cluster or client mode while submitting spark job to use resources efficiently.
Read more on YARN cluster vs client
Master node runs all the other services like hive, mysql, etc. Those services may taking 5GB of ram if aren’t using standalone mode.
In yarn UI (http://<master-public-dns>:8088
) you can check what other containers are running in more detail.
You can check where your spark driver and executer are spinning,
in spark UI http://<master-public-dns>:18080
.
Select your job and go to the Executor section, there you would find machine ip of each executor.
Enable ganglia in EMR OR go to CloudWatch ec2 metric to check each machine utilization.
Spark doesn’t start or terminates nodes.
If you want to scale your cluster depending upon job load, apply autoscaling policy to CORE or TASK instance group.
But at-least you need 1 CORE node always running.