Search code examples
mavenazure-data-factorydatabricksazure-databricksdatabricks-workflows

Install Maven Library on (ADF) Databricks Job Cluster


I've tried using /databricks/spark/bin/spark-shell --packages com.crealytics:spark-excel_2.13:3.4.1_0.19.0 in my init script, however I get the error Error: Could not find or load main class org.apache.spark.launcher.Main /databricks/spark/bin/spark-class: line 101: CMD: bad array subscript.

I also tried using .config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.15.0") in my SparkSession initialization as below, but it looks like the config is getting ignored.

from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession

builder = (
    SparkSession
    .builder
    .appName("oms-xml-streaming")
    .config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.15.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.databricks.delta.autoCompact.enabled", True)
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

Workspace Libraries have been deprecated so I cannot download the JAR to my workspace and copy it to /databricks/jars/ either.

Any ideas?


Solution

  • In Azure Data Factory, libraries for Databricks are specified on the task level, not on the linked service level. Create a task (notebook/jar/python) and then you'll be able to specify libraries for it in the "Settings" tab of the task properties, like this:

    enter image description here

    If you're using a connection to an existing cluster, then you need to install libraries to it.