Search code examples
azureapache-sparksparkrdatabricksazure-databricks

spark.conf.set with SparkR


I have a Databricks cluster running on Azure and want read / write data from Azure Data Lake Storage using SparkR / sparklyr. Therefore I configured the two resources.

Now I have to provide the Spark environment the necessary configurations to authenticate against the Data Lake Storage.

Setting the configs using the PySpark API works:

    spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
    spark.conf.set("dfs.adls.oauth2.client.id", "****")
    spark.conf.set("dfs.adls.oauth2.credential", "****")
    spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/****/oauth2/token")

In the end SparkR / sparklyr should be used. Here I couldn't figure out where to set the spark.conf.set. I would have guessed something like:

    sparkR.session(
    sparkConfig = list(spark.driver.memory = "2g",
    spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential"),
    spark.conf.set("dfs.adls.oauth2.client.id", "****"),
    spark.conf.set("dfs.adls.oauth2.credential", "****"),
    spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/****/oauth2/token")
    ))

Would be awesome if one of the experts using the SparkR API could help me out here. Thanks!

EDIT: The answer by user10791349 is correct and it works. Another solution is mounting the external data source which is best practice. This is currently only possible using Scala or Python but the mounted data source is afterwards available using the SparkR API.


Solution

  • sparkConfig should be

    named list of Spark configuration to set on worker nodes.

    So the right format is

    sparkR.session(
      ... # All other options
      sparkConfig = list(
        spark.driver.memory = "2g",
        dfs.adls.oauth2.access.token.provider.type = "ClientCredential",
        dfs.adls.oauth2.client.id = "****",
        dfs.adls.oauth2.credential = "****",
        dfs.adls.oauth2.refresh.url ="https://login.microsoftonline.com/****/oauth2/token"
      )
    )
    

    Remember that the many configuration will be recognized only if there is no active session.