Search code examples
pysparkdatabricksazure-databricks

Is there a way to loop through a complete Databricks notebook (pySpark)?


Let’s take an example. I’m working on a large dataset and want to substrat my treatment on a weekly basis. My process is divided in multiple chunks / command for now.

My question is, is it possible to loop on all the notebook or should I regroup all my code/treatment in the same chunk ?

For instance, working on January 2021. I want to make the code run on a weekly bases, giving him the starting date, running from this date to day+7, apply all the treatment and store the results, update my start variable from day+8 and this until it reach the limit fixed, for instance 31-January.

Is there a way to do it without regrouping all of the code in the same chunk? as a « run all above // all below command in line ? »


Solution

  • You can implement this by changing your notebook to accept parameter(s) via widgets, and then you can trigger this notebook, for example, as Databricks job or using dbutils.notebook.run from another notebook that will implement loop (doc), passing necessary dates as parameters.

    This will be:

    • in your original notebook:
    starting_date = dbutils.widgets.get("starting_date")
    .... your code
    
    • in the calling notebook (60 is timeout, could be higher depending on the amount of transformations):
    dbutils.notebook.run("path_to_orginal_notebook", 60, 
       {"starting_date": "2021-01-01"})