Search code examples
databricksazure-databricksdelta-lake

Any reason to set optimizeWrite = 'true', when autoCompact = 'auto' and no partitions are used


I work with time series data and Ingestion-Time Clustering (no partitioning) has proven to work well. In the Databricks docs is written "Optimized writes are most effective for partitioned tables, as they reduce the number of small files written to each partition."

Without any partitions, and with autoCompact='auto, I am wondering if there is any benefit of setting optimizeWrite = 'true',? The smaller file sizes that are written to disc directly from the executors will anyways be compacted with autoCompact. Is my understanding correct?


Solution

  • yes, they could be compacted, but you need to take into account this statement from the docs:

    Auto compaction occurs after a write to a table has succeeded and runs synchronously on the cluster that has performed the write.

    So the write time could increase. Plus, the weakest point of both auto-compact & optimized write is that they don't collocate the data like it's done with OPTIMIZE ZORDER BY ....

    But right now I would recommend to look onto Liquid Clustering (doc, blog post) - it could be better from performance standpoint than both auto-compact and automatic or explicit optimization.