Search code examples
databricksazure-databricksdelta-lake

Run both Databricks Optimize and Vacuum?


Does it make sense to call BOTH Databricks (Delta) Optimize and Vacuum? It SEEMS like it makes sense but I don't want to just infer what to do. I want to ask.

Vacuum

Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days.

Optimize

Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed.

Second question: if the answer is yes, which is the best order of operations?

  1. Optimize then Vacuum
  2. Vacuum then Optimize

Solution

  • Yes, you need to run both commands at least to cleanup the files that were optimized by OPTIMIZE. With default settings, the order shouldn't matter, as it will delete files only after 7 days. Order will matter only if you run VACUUM with retention of 0 seconds, but it's not recommended anyway as it will remove whole history.