Search code examples
azuredatabricks

Changing Databricks Liquid Clustering Keys


I'm working with Liquid Clustering in Databricks and have a question regarding what happens when the clustering key is changed.

The Databricks documentation states that after modifying the clustering key and running OPTIMIZE, the new clustering will be applied only to new data. However, it does not specify what happens to the historical data that was previously written with the old clustering key.

Does the historical data remain in its original layout? Is there a way to reorganize the old data to follow the new clustering key? If anyone has insights or relevant links on this topic, I’d really appreciate it.

Thanks!

I have already reviewed the available documentation but couldn't find a clear answer.


Solution

  • You need to use the following SQL command (only available from DBR 16.0 onwards.

    OPTIMIZE table_name FULL;
    

    You can refer to the documentation here which states

    Run OPTIMIZE FULL when you enable clustering for the first time or change clustering keys. If you have previously run OPTIMIZE FULL and there has been no change to clustering keys, OPTIMIZE FULL runs the same as OPTIMIZE. Always use OPTIMIZE FULL to ensure that data layout reflects the current clustering keys