Search code examples
jupyter-notebookazure-data-factoryazure-synapse

(Azure Data Factory Pipeline) For each Notebook in a specific folder


How can I create a ForEach activity that:

  1. Get the list of all Notebooks existing in a specific folder workspace in Databricks.
  2. Execute each Notebook

Because currently, I'm doing it by adding a Notebook activity for each Notebook, and connecting them one after another. But this kind of working is not efficient, because when a new Notebook is created in Databricks, I must remember to update my Pipeline execution in Azure Synapse Data Factory.

Thanks.


Solution

  • You can use this REST API to get the Notebooks list in a cluster. First, I tried to get the list using web activity but not able to do it. That's why I have used Databricks notebook(start_notebook) to get the list of notebooks and then filtered required notebooks.

    start_notebook code:

    import requests
    import json
    
    my_json = {"path": "/Users/< yours@mail.com >/Folder/"}    
    
    auth = {"Authorization": "Bearer <Access token>"}
    
    response = requests.get('https://adb-1234567890123456.7.azuredatabricks.net/api/2.0/workspace/list', json = my_json, headers=auth).json()
    dbutils.notebook.exit(response)
    

    Then In ADF Notebook activity output you can get the list of notebook as JSON array like below.

    enter image description here

    Now use filter activity to filter the start_notebook from the above array.

    I have used a parameter for the name.

    enter image description here

    Filter activity:

    enter image description here

    Items: @activity('Notebook1').output.runOutput.objects Condition:@not(equals(last(split(string(item().path),'/')), pipeline().parameters.start))

    Filter output array:

    enter image description here

    Give this output array to a ForEach as @activity("Filter1").output.Value and inside forEach use Notebook activity(give @item().path for Notebook path).

    enter image description here