Search code examples
apache-sparkdriverazure-databricks

Spark driver stopped unexpectedly (Databricks)


I have a Python notebook in Azure Databricks which performs a for loop with 137 iterations. For each iteration, it calls another Scala notebook using dbutils.notebook.run. The Scala notebook creates a DataFrame from a query to a MongoDB database. I create a global temporary view in the Scala notebook using df.createOrReplaceGlobalTempView("<<view_name>>") because I need to recover this data from the Python notebook and keep processing it. The code to read from the Python notebook looks like this:

global_temporary_database = spark.conf.get("spark.sql.globalTempDatabase")
for _ in range(137):
    dbutils.notebook.run(path="<<path_to_scala_notebook>>", timeout_seconds=600, arguments=<<current_configuration>>)
    # Recover the data and drop the global temporary view
    df = table(f"{global_temporary_database}.<<view_name>>")
    spark.catalog.dropGlobalTempView("<<view_name>>")
    # Do some processing like filtering rows and rename columns

This works for a limited number of iterations. However, when I try to run the whole loop I get the following error:

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached

I've tried to use time.sleep() to add some delay between iterations to avoid overloading the cluster. I've also tried to use spark.catalog.clearCache() after each iteration but it doesn't work.

Below are the cluster specs:

  • Databricks Runtime Version: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
  • 2 Workers: 61 GB Memory, 8 Cores
  • 1 Driver: 16 GB Memory, 4 Cores

Unfortunately, I need to use 2 notebooks because I'm using a Scala library to apply some operations to the DataFrames, and then I need to share this data between notebooks, so there's no way to avoid that part.

Any help would be appreciated.


Solution

  • I resolved this by calling my Scala notebook only once and moving all the loop logic into the Scala notebook, which was a bit tedious. But I guess it's not a good idea to call a notebook 137 times as it produces a lot of overhead.

    The execution time has improved significantly. Now it takes around 20 minutes, whereas previously it took 1 hour to eventually stop the driver and not finish.