Search code examples
amazon-s3pysparkdatabricksparquet

pyspark overwrite silently failed to remove stale parquet files


Environment: 9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12)

I performed the overwrite operation below: df.repartition(parts).write.mode('overwrite').parquet(s3_output_path)

You can see in the screenshot below that this process succeeded and wrote a new parquet table. However, it failed to delete all of the stale parquet files from the previous version. I don't know how this is possible. I thought that spark (in overwrite mode) does not start writing new parquet files until all of the stale parquet files are deleted.

I think the presence of this _committed_vacuum* file may be related. However, I cannot find any documentation on the purpose of this file. Note: I am NOT using delta.

Why was no error raised when stale parquet files failed to be fully deleted?

enter image description here


Solution

  • It is possible for stale parquet files to remain after a successful overwrite, depending on the commit protocol being used under the hood. In this case Databricks DBIO protocol was being used.

    Atomic file overwrite - It is sometimes useful to atomically overwrite a set of existing files. Today, Spark implements overwrite by first deleting the dataset, then executing the job producing the new data. This interrupts all current readers and is not fault-tolerant. With transactional commit, it is possible to "logically delete" files atomically by marking them as deleted at commit time. Atomic overwrite can be toggled by setting "spark.databricks.io.directoryCommit.enableLogicalDelete true|false". This improves user experience across those that are accessing the same datasets at the same time. https://www.databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html

    It is clear that DBIO protocol was being used due to the presence of _* files such as _SUCCESS and _committed*.

    Also, spark.sql.sources.commitProtocolClass was set to com.databricks.sql.transaction.directory.DirectoryAtomicCommitProtocol

    According to Databricks documentation,

    On non-Delta tables, Databricks automatically triggers VACUUM operations as data is written. https://docs.databricks.com/en/sql/language-manual/delta-vacuum.html#vacuum-a-non-delta-table

    So it should be possible but rare for stale parquet files to remain even after a DBIO overwrite. The chances of this are not clear due to lack of documentation for non-Delta vacuuming.

    Other Sources: