hadoop apache-spark apache-spark-sql mapr

Configuring Executor and Driver memory in Spark-on-Yarn

I am confused with configuring executor and driver memory in Spark-1.5.2.

My environment settings are as below:

3 Node MAPR Cluster - Each Node: Memory 256G, 16 CPU 
Hadoop 2.7.0 
Spark 1.5.2 - Spark-on-Yarn

Input data information:

460 GB Parquet format table from Hive I'm using spark-sql for querying the hive context with spark-on-yarn,but it's lot slower than the Hive, and am not sure with the right memory configurations for Spark,

These are my config's,

    export SPARK_DAEMON_MEMORY=1g
    export SPARK_WORKER_MEMORY=88g

    spark.executor.memory              2g
    spark.logConf                      true
    spark.eventLog.dir                 maprfs:///apps/spark
    spark.eventLog.enabled             true
    spark.serializer                   org.apache.spark.serializer.KryoSerializer
    spark.driver.memory                5g
    spark.kryoserializer.buffer.max    1024m

How to avoid Spark java.lang.OutOfMemoryError: Java heap space exceptions and GC overhead limit exceeded exceptions!! ???

Really appreciate your assistance in this!

Solution

At a first glance, you are running out the memory of your executors. I would suggest increasing their memory.

Note that SPARK_WORKER_MEMORY is only used in standalone mode. SPARK_EXECUTOR_MEMORY is used in YARN mode.

If you are not running anything else on the cluster you could try out the following config:

spark.executor.memory   16g
spark.executor.cores    1
spark.executor.instances 40
spark.driver.memory  5g (make it bigger if expected 
                         final result dataset is larger)

I do not recommend to set a large executor memory because that typically increments the GC time. Other thing I see, it is that those instances are memory optimized. Think twice if this fits your case.