Currently, I'm having a few issues with having a spark dataframe (autoloader) in one cell that may take a few moments to write data. Then, in the following cell, the code references the work done by the first table. However, if the entire notebook is run (particularly, as a Job) due to the distributed nature of spark, the second cell runs before the first cell is fully completed. How can I have the second cell await the finish of the writeStream without putting them in separate notebooks.
Example:
Cell1
autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')
Cell2
df = spark.sql('select count(*) from TABLE1')
You need to use awaitTermination
function to wait until stream processing is finished (see docs). Like this:
autoload = pysparkDF.writeStream.format('delta')....table('TABLE1')
autoload.awaitTermination()
df = spark.sql('select count(*) from TABLE1')
although it could be read easier & harder to make mistake with something like this:
df = spark.read.table('TABLE1').count()
Update: To wait for multiple streams:
while len(spark.streams.active) > 0:
spark.streams.resetTerminated() # Otherwise awaitAnyTermination() will return immediately after first stream has terminated
spark.streams.awaitAnyTermination()