Search code examples
azure-data-factoryazure-databricks

Running multiple databricks notebooks concurrently


I have an Azure Data Factory pipeline that triggers a Databricks notebook. Inside this notebook, I have the following code to unmount / mount storage,

# Unmount and mount storage
mnt_point = "/mnt"
out_mnt_point = "/out_mnt"

# Unmount storage, if any
for mount in dbutils.fs.mounts():
    if (mount.mountPoint == mnt_point):
        dbutils.fs.unmount(mnt_point)
    elif (mount.mountPoint == out_mnt_point):
        dbutils.fs.unmount(out_mnt_point)

# Mount storage for input
dbutils.fs.mount(
    source = f"wasbs://" + input_folder + "@xxx.blob.core.windows.net",
    mount_point = mnt_point,
    extra_configs = {f"fs.azure.account.key.xxx.blob.core.windows.net": azure_account_key }
)

# Mount storage for output
dbutils.fs.mount(
    source = f"wasbs://" + output_folder + "@xxx.blob.core.windows.net",
    mount_point = out_mnt_point,
    extra_configs = {f"fs.azure.account.key.xxx.blob.core.windows.net": azure_account_key }
)

My question is that, if there are multiple instances of the pipeline are running concurrently, will this affect each other (e.g. one notebook is mounting and the other is unmounting and make the other process fail)? Or does each instance has it's own specific isolated resource?


Solution

    • Running the same notebook simultaneously to unmount and mount containers in azure data factory, they are considered as separate notebook runs.
    • However, since the code is accessing the same resources (blob storage containers), notebooks running simultaneously is causing error.
    • I have used a for each activity with batch execution to run databricks notebook activity to execute notebook activity simultaneously for demonstration.

    enter image description here

    • Now, when this is executed, the activities fail apart from one in this case:

    enter image description here

    • When I open the runs to check for error, the message says that Directory already mounted: /mnt/ip.

    enter image description here

    • This is because the simultaneous notebook execution is causing some of the mount operations to occur before others in respective notebook runs. So, there is great chance for notebook run to fail in case of simultaneous execution.

    UPDATE:

    Try using the following code:

    # Unmount and mount storage
    azure_account_key = 'SRzAYuN2/aRJuSdHkwSXxXIE3qpBl0ekvtVSQ4BKqFAi+z2SM86qrUM3rt5tD3s68m450n/aledC+AStTrzdBw=='
    mnt_point = "/mnt/ip"
    out_mnt_point = "/mnt/op"
    
    ip_mount = 0
    op_mount = 0
    
    
    for mount in dbutils.fs.mounts():
        if(ip_mount ==0 or op_mount ==0):
            if (mount.mountPoint == mnt_point):
                ip_mount+=1
            elif (mount.mountPoint == out_mnt_point):
                op_mount+=1
        
    else:
        if(ip_mount==0):
            dbutils.fs.mount(
        source = f"wasbs://" + input_folder + "@xxx.blob.core.windows.net",
        mount_point = mnt_point,
        extra_configs = {f"fs.azure.account.key.xxx.blob.core.windows.net": azure_account_key }
    )
            
        if(op_mount==0):
            dbutils.fs.mount(
        source = f"wasbs://" + output_folder + "@xxx.blob.core.windows.net",
        mount_point = out_mnt_point,
        extra_configs = {f"fs.azure.account.key.xxx.blob.core.windows.net": azure_account_key }
    )
    
    • Running the notebook with above code simultaneously did not throw any error:

    enter image description here

    NOTE: The case is same for even simultaneous pipeline runs as well instead of simultaneous activity runs as demonstrated above.