Search code examples
apache-sparkrstudiosparklyr

SparklyR connection to standalone spark cluster only connecting to 2/6 workers


I've just finally managed to set up my stack to use RStudio to connect to a standalone spark cluster (with file storage in CassandraDB) via sparklyR.

The only issue I still haven't been able to resolve is how to get my sparklyR connection to utilise all the available worker nodes on cluster (there are 6 in total). Every time I connect, the Executor Summary page shows only 2 workers are being utilised by the sparklyR connection (with 1 executor on each node).

I've tried playing around with the config.yml file for the spark_connect call, including setting spark.executor.instances: 6 and spark.num.executors: 6, but that doesn't make a difference. Is there another setting I can use to get sparklyR to use all the nodes? Can I somehow pass a list of all the worker IP addresses to spark_connect so that it connects to them all?

My setup is as follows: RStudio: 1.0.136, sparklyR: 0.5.3-9000, Spark version (on cluster & locally): 2.0.0.


Solution

  • Finally solved it! It was so simple and obvious I cannot believe I missed it.

    The config (spark-defaults.conf) file had the settings:

    spark.executor.cores: 5
    spark.cores.max: 12
    

    Which of course means it could not start more than 2 (5-core) executors, since the max number of cores the entire app was allowed was 12.