azure databricks azure-synapse delta-live-tables

Dynamic execution of a DLT pipeline for SCD2 creation in Databricks

I am trying to load my datalake SCD2 tables via Databricks DLT pipeline. My goal is to dynamically load the tables in a notebook by passing the required information in the "Configuration" setting in the DLT pipeline.

The "Configuration" setting could look like this:

{
    "object1": {
        "object_name": "table",
        "business_key_column": "table_bk",
        "except_column_list": "change_date",
        "sequence_column": "change_date"
    },
    "object2"...
}

Later, the target notebook should obtain the parameters in the pipeline configuration and load the tables accordingly. Can I obtain the pipeline configuration setting in the target notebook? If yes: How?

Solution

Yes. You can get the pipeline configuration setting in the notebook using spark configuration.

As mentioned in this documentation the configuration setting are available in pipeline queries through spark configuration.

enter image description here

You can access using below statement spark.conf.get("key").

Below is the configuration given. enter image description here

And accessed in notebook.

import dlt

@dlt.table(
comment="The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
)
def clickstream_raw():
    json_path = spark.conf.get("json_path")
    return (spark.read.format("json").load(json_path))

enter image description here

Based on given path the dataframe is returned.

In your case you can get it like below.

import dlt

@dlt.table()
def object_list():
    import json
    
    json_obj = json.loads(spark.conf.get("object_list"))
    data = [json.dumps(json_obj[i]) for i in json_obj]
    json_df = spark.read.json(spark.sparkContext.parallelize(data))
    return json_df

After getting json object you can use it accordingly.

Output: enter image description here

and

enter image description here

and

enter image description here