Search code examples
apache-sparkjvmsparklyr

Spark Memory Issues with sparklyr


I'm having some strange problems on Spark running with sparklyr.

I'm currently on an R production server, connecting to a my Spark Cluster in client mode via spark://<my server>:7077 and then pulling data from a MS SQL Server.

I was able to do this recently with no issues, but I recently was given a bigger cluster and am now having memory issues.

First I was getting inexplicable 'out of memory' errors during my processing. This happened a few times and then I started getting 'Out of memory unable to create new thread' errors. I checked the number of threads I was using compared to the max for my user on both the R production server and Spark server and I was no where near the max.

I restarted my master node and am now getting:

# There is insufficient memory for the Java Runtime Environment to continue.
# Cannot create GC thread. Out of system resources.

What the heck is going on??

Here are my specs:
- Spark Standalone running via root user.
- Spark version 2.2.1
- Sparklyr version 0.6.2
- Red Hat Linux


Solution

  • I figured this out by chance. It turns out that when you are running operations on an external spark cluster on client mode it still runs Spark locally as well. I think that the local Spark did not have enough memory allocated and that was causing the error. My fix was simple:

    Instead of allocating memory via:

    spark_conf = spark_config()
    spark_conf$`spark.driver.memory` <- "8G"
    spark_conf$`spark.executor.memory` <- "12G"
    

    I used:

    spark_conf = spark_config()
    spark_conf$`sparklyr.shell.driver-memory` <- "8G"
    spark_conf$`sparklyr.shell.executor-memory` <- "12G"
    

    The former will set resources on the cluster (spark context) directly. The latter sets it in the spark context as well as the rest of the sparklyr application.