python pyspark databricks aws-databricks

How do you use either Databricks Job Task parameters or Notebook variables to set the value of each other?

The goal is to be able to use 1 script to create different reports based on a filter.

I want my Databricks Job Task parameters and Notebook variables to share the same value for filtering purposes.

This is how I declared these widgets and stored in a variable:

dbutils.widgets.text(name='field', defaultValue='', label='field')
f1= dbutils.widgets.get('field')

Solution

There are two methods to use with widgets [.text() + .get()]. One to create the widget the first time and one to grab the value from the widget.

Here is some sample screen shots from a class I teach.

The .text method creates the widget and sets the value. It only has to be executed once. It can be commented out afterwards. This has to be recreated if you move the code from one Databricks workspace to another.

In this example, I have a notebook that reads a csv file and performs a full load of a delta table. The source file, destination path, debug flag, file schema and partition count as passed as parameters.

I am using the adventure works data files.

The parent notebook calls the child notebook 15 times to load and create hive tables for the adventure works schema.

The same can be done for report file creation. I hope this explains and shows how to use widgets effectively and call a notebook using the dutils.notebook.run() method.

You can translate the .run() method calls to tasks in a databricks job. What I do not like is the interface. There is a lot of clicking and entering vs cut/paste when using a notebook. I guess you can use the JSON tab to the right.

To be honest, I usually use ADF for scheduling since most of my clients data is hybrid in nature.

Here is a screen shot of a sample job with at task to load the accounts hive table.

Last but not least, the job runs to success.