I am trying to delete an existing Parquet file and replace it with data in a dataframe that read the data in the original Parquet file before deleting it. This is in Azure Synapse using PySpark.
So I created the Parquet file from a dataframe and put it in the path:
full_file_path
I am trying to update this Parquet file. From what I am reading, you can't edit a Parquet file so as a workaround, I am reading the file into a new dataframe:
df = spark.read.parquet(full_file_path)
I then create a new dataframe with the update:
df.createOrReplaceTempView("temp_table")
df_variance = spark.sql("""SELECT * FROM temp_table WHERE ....""")
and the df_variance dataframe is created.
I then delete the original file with:
mssparkutils.fs.rm(full_file_path, True)
and the original file is deleted. But when I do any operation with the df_variance dataframe, like df_variance.count(), I get a FileNotFoundException error. What I am really trying to do is:
df_variance.write.parquet(full_file_path)
and that is also a FileNotFoundException error. But I am finding that any operation I try to do with the df_variance dataframe is producing this error. So I am thinking it might have to do with the fact that the original full_file_path has been deleted and that the df_variance dataframe maintains some sort of reference to the (now deleted) file path, or something like that. Please help. Thanks.
Spark dataframes aren't collections of rows. Spark dataframes use "deferred execution". Only when you call
df_variance.write
is a spark job run that reads from the source, performs your transformations, and writes to the destination.
A Spark dataframe is really just a query that you can compose with other expressions before finally running it.
You might want to move on from parquet to delta. https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-what-is-delta-lake