Search code examples
apache-sparkpysparkdatabricksspark-structured-streaming

PySpark Wait to finish in notebook (Databricks)


Currently, I'm having a few issues with having a spark dataframe (autoloader) in one cell that may take a few moments to write data. Then, in the following cell, the code references the work done by the first table. However, if the entire notebook is run (particularly, as a Job) due to the distributed nature of spark, the second cell runs before the first cell is fully completed. How can I have the second cell await the finish of the writeStream without putting them in separate notebooks.

Example:

Cell1

autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')

Cell2

df = spark.sql('select count(*) from TABLE1')

Solution

  • You need to use awaitTermination function to wait until stream processing is finished (see docs). Like this:

    • cell 1
    autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')
    autoload.awaitTermination()
    
    • cell 2
    df = spark.sql('select count(*) from TABLE1')
    

    although it could be read easier & harder to make mistake with something like this:

    df = spark.read.table('TABLE1').count()
    

    Update: To wait for multiple streams:

    while len(spark.streams.active) > 0:
      spark.streams.resetTerminated() # Otherwise awaitAnyTermination() will return immediately after first stream has terminated
      spark.streams.awaitAnyTermination()