I've just finally managed to set up my stack to use RStudio to connect to a standalone spark cluster (with file storage in CassandraDB) via sparklyR.
The only issue I still haven't been able to resolve is how to get my sparklyR connection to utilise all the available worker nodes on cluster (there are 6 in total). Every time I connect, the Executor Summary page shows only 2 workers are being utilised by the sparklyR connection (with 1 executor on each node).
I've tried playing around with the config.yml file for the spark_connect
call, including setting spark.executor.instances: 6
and spark.num.executors: 6
, but that doesn't make a difference. Is there another setting I can use to get sparklyR to use all the nodes? Can I somehow pass a list of all the worker IP addresses to spark_connect
so that it connects to them all?
My setup is as follows: RStudio: 1.0.136, sparklyR: 0.5.3-9000, Spark version (on cluster & locally): 2.0.0.
Finally solved it! It was so simple and obvious I cannot believe I missed it.
The config (spark-defaults.conf
) file had the settings:
spark.executor.cores: 5
spark.cores.max: 12
Which of course means it could not start more than 2 (5-core) executors, since the max number of cores the entire app was allowed was 12.