Search code examples
azure-data-factoryazure-databricks

Azure Data Factory databricks linked service for new job cluster init script in Unity Catalog


I have a problem regarding to create a linked service in ADF for Databricks Job Cluster, I try to add a details for init script with Volume (an init script is located in a Volume in Unity Catalog)

"newClusterInitScripts": [
            "/Volumes/myunitycatalog/framework_config/external_libraries/init_script/sqlserver-odbc-install.sh"
        ]

But when I run pipeline and it trigger a Databricks notebook, got an error said

databricks_error_message:RESOURCE_DOES_NOT_EXIST: No file found at /Volumes/myunitycatalog/framework_config/external_libraries/init_script/sqlserver-odbc-install.sh.. 

but in my Databricks workspace I can use this path for init script in shared cluster, so what am I doing wrong on the linked service config?


Solution

  • You can try the below workaround to achieve your requirement.

    You can use REST API calls to create a cluster with required configurations and execute the notebook and then delete the cluster after the notebook execution.

    Firstly, use a web activity in ADF and use this REST API to create the cluster as shown below.

    URL : https://<xxxxx>.azuredatabricks.net/api/2.1/clusters/create
    Method : POST
    Body : 
    
    {
        "num_workers": null,
        "autoscale": {
            "min_workers": 3,
            "max_workers": 8
        },
        "cluster_name": "C1",
        "spark_version": "15.4.x-scala2.12",
        "spark_conf": {},
        "node_type_id": "Standard_D4ds_v5",
        "custom_tags": {},
        "spark_env_vars": {
            "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
        },
        "autotermination_minutes": 120,
        "init_scripts": [
        {
          "volumes": {
            "destination": "/Volumes/rakeshprimadb/rakeshschema/rakeshvolume/install_packages.sh"
          }
        }
        ]
    }
    

    Pass your init script path in the body JSON and use the PAT token for the authentication.

    enter image description here

    This will create the cluster, but it will take some time to create the cluster, so use a wait activity with around 300 to 500 second's time. The web activity will return the created cluster_id. You can use this cluster_id to call the required notebook.

    In the Databricks notebook linked service, select the existing interactive cluster and use linked service parameter @linkedService().cluster_id for specifying the cluster id dynamically.

    enter image description here

    Now, add this to the Notebook activity in the pipeline and pass the cluster_id from the web activity output to this parameter using expression @activity('Web1').output.cluster_id.

    enter image description here

    Give your notebook path and then take another web activity for deleting the cluster.

    Use this REST API for the deletion of cluster. You need to pass the cluster_id via the body of the request.

    URL : https://adb-<xxxxx>.azuredatabricks.net/api/2.1/clusters/permanent-delete
    Method : POST
    Body : @json(concat('{"cluster_id":"',activity('Web1').output.cluster_id,'"}'))
    

    Use the above configurations in the web activity with the same PAT token for the authentication.

    enter image description here

    After the notebook execution, the same cluster will be deleted permanently by the web activity.