Search code examples
apache-sparkpysparkamazon-emr

Should slave nodes be launched/started separately on Amazon EMR server?


I have just launched Amazon Elastic MapReduce server after trying java.lang.OutofMemorySpace:Java heap space while fetching 120 million rows from database in pyspark where I have 1 master and 2 slave nodes running each having 4 cores and 8G RAM.

I am trying to load a massive dataset from MySQL database (containing approx. 120M rows). The query loads fine but when I do a df.show() operation or when I try to perform operations on the spark dataframe I am getting errors like -

  1. org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
  2. Task 0 in stage 0.0 failed 1 times; aborting job
  3. java.lang.OutOfMemoryError: GC overhead limit exceeded

My questions are -

  1. When I SSH into the Amazon EMR server and do htop, I see that 5GB out of 8GB is already in use. Why is this?
  2. On the Amazon EMR portal, I can see that the master and slave servers are running. I'm not sure if the slave servers are being used or if its just the master doing all the work. Do I have to separately launch or "start" the 2 slave nodes or does Spark do that automatically? If yes, how do I do this?

Solution

  • If you are running spark as standalone mode (local[*]) from master then it will only use master node.
    How are you submitting spark job?
    Use yarn cluster or client mode while submitting spark job to use resources efficiently.
    Read more on YARN cluster vs client

    Master node runs all the other services like hive, mysql, etc. Those services may taking 5GB of ram if aren’t using standalone mode.

    In yarn UI (http://<master-public-dns>:8088) you can check what other containers are running in more detail.

    You can check where your spark driver and executer are spinning,
    in spark UI http://<master-public-dns>:18080.
    Select your job and go to the Executor section, there you would find machine ip of each executor.

    Enable ganglia in EMR OR go to CloudWatch ec2 metric to check each machine utilization.

    Spark doesn’t start or terminates nodes.
    If you want to scale your cluster depending upon job load, apply autoscaling policy to CORE or TASK instance group.
    But at-least you need 1 CORE node always running.