Search code examples
pysparkdelta-lake

How to wait for df.write() to be complete


In my pyspark notebook, I

  1. read from tables to create data frames
  2. aggregate the data frame
  3. write to a folder
  4. create a SQL table from the output folder

For

#1. I do `spark.read.format("delta").load(path)`
#2. I  do `df = df.filter.(...).groupby(...).agg(...)
#3. I do `df.write.format("delta").mode("append").save(output_folder)`
#4. I do `spark.sql(f"CREATE TABLE IF NOT EXISTS {sql_database}.{table_name} USING delta LOCATION '{path to output folder in #3}'") `

The issue I am having is I have debug print statements before and after step 3 and 4. And I check that there are parquet files in my output folder and the Path I use in creating the SQL table is correct. And there is no exception in the console.

But when I try to see look for that newly create table in SQL tool, I can't see it.

Can you please tell me if I need to wait for the 'write' to be done before I create the SQL tables? If yes, what do I need to do to wait for the write to be done?


Solution

  • By default, the .write is synchronous operation, so step 4 will be executed only after step 3 is done. But really, it's easier to write & create a table in one step by using .saveAsTable together with the path option:

    df = spark.read.format("delta").load(path)
    df = df.filter.(...).groupby(...).agg(...)
    df.write.format("delta").mode("append") \
      .option("path", output_folder) \
      .saveAsTable(f"{sql_database}.{table_name}")