Search code examples
scalaapache-sparkapache-spark-standalone

Simple spark job fail due to GC overhead limit


I've created a standalone spark (2.1.1) cluster on my local machines with 9 cores / 80G each machine (total of 27 cores / 240G Ram)

I've got a sample spark job that sum all the numbers from 1 to x this is the code :

package com.example

import org.apache.spark.sql.SparkSession

object ExampleMain {

    def main(args: Array[String]): Unit = {
      val spark = SparkSession.builder
          .master("spark://192.168.1.2:7077")
          .config("spark.driver.maxResultSize" ,"3g")
          .appName("ExampleApp")
          .getOrCreate()
      val sc = spark.SparkContext
      val rdd = sc.parallelize(Lisst.range(1, 1000))
      val sum = rdd.reduce((a,b) => a+b)
      println(sum)
      done
    }

    def done = {
      println("\n\n")
      println("-------- DONE --------")
    }
}

When running the above code I get results after a few seconds so I've crancked up the code to sum all the numbers from 1 to 1B (1,000,000,000) and than I get GC overhead limit reached

I read that spark should spill memory to the HDD if there isn't enough memory, I've tried to play with my cluster configuration but that didn't helped.

Driver memory = 6G
Number of workers = 24
Cores per worker = 1
Memory per worker = 10

I'm not a developer, and have no knowledge in Scala but would like to find a solution to run this code without GC issues.

Per @philantrovert request I'm adding my spark-submit command

/opt/spark-2.1.1/bin/spark-submit \
--class "com.example.ExampleMain" \
--master spark://192.168.1.2:6066 \
--deploy-mode cluster \
/mnt/spark-share/example_2.11-1.0.jar

In addition my spark/conf are as following:

  • slaves file contain the 3 IP addresses of my nodes (including the master)
  • spark-defaults contain:
    • spark.master spark://192.168.1.2:7077
    • spark.driver.memory 10g
  • spark-env.sh contain:
    • SPARK_LOCAL_DIRS= shared folder among all nodes
    • SPARK_EXECUTOR_MEMORY=10G
    • SPARK_DRIVER_MEMORY=10G
    • SPARK_WORKER_CORES=1
    • SPARK_WORKER_MEMORY=10G
    • SPARK_WORKER_INSTANCES=8
    • SPARK_WORKER_DIR= shared folder among all nodes
    • SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"

Thanks


Solution

  • I suppose the problem is that you create a List with 1 Billion entries on the driver, which is a huge datastructure (4GB). There is a more efficient way the programmatically create an Dataset/RDD:

    val rdd = spark.range(1000000000L).rdd