I am trying to load my datalake SCD2 tables via Databricks DLT pipeline. My goal is to dynamically load the tables in a notebook by passing the required information in the "Configuration" setting in the DLT pipeline.
The "Configuration" setting could look like this:
{
"object1": {
"object_name": "table",
"business_key_column": "table_bk",
"except_column_list": "change_date",
"sequence_column": "change_date"
},
"object2"...
}
Later, the target notebook should obtain the parameters in the pipeline configuration and load the tables accordingly. Can I obtain the pipeline configuration setting in the target notebook? If yes: How?
Yes. You can get the pipeline configuration setting in the notebook using spark configuration.
As mentioned in this documentation the configuration setting are available in pipeline queries through spark configuration.
You can access using below statement
spark.conf.get("key")
.
Below is the configuration given.
And accessed in notebook.
import dlt
@dlt.table(
comment="The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
)
def clickstream_raw():
json_path = spark.conf.get("json_path")
return (spark.read.format("json").load(json_path))
Based on given path the dataframe is returned.
In your case you can get it like below.
import dlt
@dlt.table()
def object_list():
import json
json_obj = json.loads(spark.conf.get("object_list"))
data = [json.dumps(json_obj[i]) for i in json_obj]
json_df = spark.read.json(spark.sparkContext.parallelize(data))
return json_df
After getting json object you can use it accordingly.
Output:
and
and