Search code examples
rapache-sparksparkr

How do I update a Spark setting in SparkR?


I am trying to pull a very large dataset from the database using SparkR onto my Databricks cluster to run some R functions on it. However, I am running into an issue where, despite my cluster definitely being sufficiently large, I am hitting an error:

Total size of serialized results of 66 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

(Yes, possibly the code should be refactored, but getting it working at all is the important thing right now.)

According to the Spark documentation, it is possible to set this to "0" in order to make the setting unlimited. However, I can't find any information on how to do this in R (a bunch on Python!).

How do I update this setting to uncap the value?


Solution

  • When using SparkR, you can set the Spark configuration using the sparkConfig parameter of the SparkR::sparkR.session() function.

    Meanwhile, your current Spark configuration is available from the SparkR::sparkR.conf() command, returned as a list.

    So, in this case, you need to use the entirety of the existing configuration except for the one parameter you want to update.

    The code below should:

    • Take a backup of the default config
    • Create a copy that can be updated
    • Update the one list item that needs tweaking
    • Apply the updated config to the session.

    I hope this helps!

    orig_spark_conf <- sparkR.conf()
    updated_spark_conf <- orig_spark_conf
    updated_spark_conf$spark.driver.maxResultSize <- "0"
    sparkR.session(sparkConfig=updated_spark_conf)