I am using Sparklyr to run a Spark application in local mode on a virtual machine with 244GB of RAM. In my code I use spark_read_csv()
to read in ~50MB of csvs from one folder and then ~1.5GB of csvs from a second folder. My issue is that the application throws an error when trying to read in the second folder.
As I understand it, the issue is that the default RAM available to the driver JVM is 512MB - too small for the second folder (in local mode all operations are run within the driver JVM, as described here
How to set Apache Spark Executor memory. So I need to increase the spark.driver.memory
parameter to something larger.
The issue is that I cannot set this parameter through the normal methods described in the sparklyr documentation (i.e. via spark_config()
, the config.yml
file, or the spark-defaults.conf
file):
in local mode, by the time you run spark-submit, a JVM has already been launched with the default memory settings, so setting "spark.driver.memory" in your conf won't actually do anything for you. Instead, you need to run spark-submit as follows:
bin/spark-submit --driver-memory 2g --class your.class.here app.jar
(from How to set Apache Spark Executor memory).
I thought I could replicate the bin/spark-submit
command above by adding the sparklyr.shell.driver-memory
option to the config.yml
; as stated in the Sparklyr documentation; sparklyr.shell*
options are command line parameters that get passed to spark-submit
, i.e. adding sparklyr.shell.driver-memory: 5G
to the config.yml
file should be equivalent to running bin/spark-submit --driver-memory 5G
.
I have now tried all of the above options and none of them change driver memory in the Spark application (which I check by looking at the 'Executors' tab of the Spark UI).
So how can I change driver memory when running Spark in local mode via Sparklyr?
Thanks for the suggestions @Aydin K. Ultimately I was able to configure driver memory by first updating java to 64bit (allows utilisation of >4G of RAM in the JVMs), then using the sparklyr.shell*
parameters within the spark_config()
object:
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- '30G'
config$`sparklyr.shell.executor-memory` <- '30G'
sc <- spark_connect(master='local', version='2.0.1', config=config)