Search code examples
databricksazure-databricksdelta-lake

What to do to prevent Delta Lake checkpoints to be removed in Azure Databricks?


I noticed that I have only 2 checkpoints files in a delta lake folder. Every 10 commits, a new checkpoint is created and the oldest one is removed.

For instance this morning, I had 2 checkpoints: 340 and 350. I was available to time travel from 340 to 359.

Now, after a "write" action, I have 2 checkpoints: 350 and 360. I'm now able to time travel from 350 to 360. What can remove the old checkpoints? How can I prevent that?

I'm using Azure Databricks 7.3 LTS ML.


Solution

  • If you want to keep your checkpoints X days, you can set delta.checkpointRetentionDuration to X days this way:

    spark.sql(f"""
            ALTER TABLE delta.`path`
                SET TBLPROPERTIES (
                    delta.checkpointRetentionDuration = 'X days'
                )
            """
    )