Search code examples
pysparkdelta-lakedelta

Delete a delta table partition based on the creation/modification date of the partition folder


As I understand, there is a vaccum command in delta to remove the old versions from delta table.

My Usecase: However, I would like to remove the partition completely based on the modification date of the partition.

For example: date=2024-11-22 is my partition name and the modification date of it is 2024-11-30, I want to remove the partition whose modification date are older than 30 days

How to perform this efficiently with the delta features and without losing the durability?


Solution

  • You're mixing up two different things here:

    1. Cleaning up files that no longer belong to the Delta table (this is what VACUUM does).
    2. Deleting data that still exists in the table but is considered outdated.

    In your case, there's no built-in way to remove partitions based on the modification date directly. A good workaround would be to add a new column to your table to track the "update timestamp" (basically your modification date).

    Once you have that, you can simply run a DELETE query like this:

    DELETE FROM your_table
    WHERE update_date < '2024-11-30';
    

    Delta Lake supports standard DML commands like DELETE, so this works natively without any additional setup.

    If you can't add a new column, you'll need to extract some other information to identify the rows you want to remove. If the partition date itself is sufficient, you can use it and perform a simple DELETE.