spark-structured-streaming azure-databricks

How to specify cluster init script for spark Job

My job needs some init scripts to be executed on cluster, presently i am using "Existing Interactive Cluster" option in job creation and have specified init script for the cluster. But this is getting charged as higher "Data analytics workload".

is there an option that i can specify "New Automated Cluster" option in job creation page and still get the init scripts executed for new cluster. I am not sure if it recommended to use Global Init script, since not all jobs needs those init script, only specific category of jobs need init script.

Solution

To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.

To set Spark properties for all clusters, create a global init script:

%scala
dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
  |#!/bin/bash
  |
  |cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
  |[driver] {
  |  "spark.sql.sources.partitionOverwriteMode" = "DYNAMIC"
  |}
  |EOF
  """.stripMargin, true)

Reference: "Spark Configuration".

Hope this helps.

If this answers your query, do click “Mark as Answer” and "Up-Vote" for the same. And, if you have any further query do let us know.