apache-spark pyspark databricks spark-structured-streaming

PySpark Wait to finish in notebook (Databricks)

Currently, I'm having a few issues with having a spark dataframe (autoloader) in one cell that may take a few moments to write data. Then, in the following cell, the code references the work done by the first table. However, if the entire notebook is run (particularly, as a Job) due to the distributed nature of spark, the second cell runs before the first cell is fully completed. How can I have the second cell await the finish of the writeStream without putting them in separate notebooks.

Example:

Cell1

autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')

Cell2

df = spark.sql('select count(*) from TABLE1')

Solution

You need to use awaitTermination function to wait until stream processing is finished (see docs). Like this:

cell 1

autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')
autoload.awaitTermination()

cell 2

df = spark.sql('select count(*) from TABLE1')

although it could be read easier & harder to make mistake with something like this:

df = spark.read.table('TABLE1').count()

Update: To wait for multiple streams:

while len(spark.streams.active) > 0:
  spark.streams.resetTerminated() # Otherwise awaitAnyTermination() will return immediately after first stream has terminated
  spark.streams.awaitAnyTermination()