Search code examples
azure-synapsearchivedelta-lake

Archiving/Purging in Delta Tables in Azure Synapse (without using Databricks)


I am working on creating a flexible and parameterized framework/pipeline that Archives/Purges the data using Azure Synapse. I have a few Delta tables in the current setup that cannot be handled like normal blobs or tables. I read the Microsoft documentation on setting the Retention Period in Databricks and can even 'Vacuum' the unused data files. Is there an alternate way in Azure Synapse to achieve the same thing?

I have a framework blueprint in place but that was created keeping the non-delta data files in mind.


Solution

  • Thank you @Tyler Long for excellent documentaion on Data lake in Azure Synapse Analytics.

    As you mentioned that you want the approach for Retention Period & 'Vacuum' for the unused data files in Azure synapse.

    • If we are performing table operations frequently, Delta tables can have an accumulated many files stored in the data lake storage.
    • When querying the Delta table read times can be affected as many files will need to be scanned and read whenever a query is preformed.
    • So compaction of table into a data frame and writing the data frame 1 file.

    As you mentioned that don't want to use the databricks I have used the Azure synapse Spark notebook to perform the Retention Period & 'Vacuum'

    The below statement will read the current version of the Delta table and write out to 1 single file.

    (spark.read
    .format("delta")
    .load("/Delta_Demo/Employees/")
    .repartition(1)
    .write
    .option("dataChange", "false")
    .format("delta")
    .mode("overwrite")
    .save("/Delta_Demo/Employees/"))
    

    enter image description here

    • Delta Lake do not delete old parquet files when change operations occur as these older files are still referenced by previous table versions for a default period of 30 days.
    • After the default retention period has passed, the parquet data files are not deleted even if they are no longer referenced by log files.
    • In order to efficiently manage data lake storage costs, the VACUUM command can be used to remove old data files

    enter image description here

    VACUUM default.DELTA_Employees
    

    enter image description here

    VACUUM default.DELTA_Employees RETAIN 240 HOURS