Search code examples
databricksspark-structured-streamingdelta-lake

Is it safe to run VACUUM and DELETE against a Delta Table while there's a Spark Streaming query doing data ingestion


I've got a 24/7 Spark Structured Streaming query (Kafka as a source) that appends data to a Delta Table.

Is it safe to periodically run VACUUM and DELETE against the same Delta Table from a different cluster while the first one is still processing incoming data ?

The table is partitioned on date and the DELETE will be done at partition level.

p.s. the infrastructure is on top of AWS.


Solution

  • If your streaming job is really append-only, then it shouldn't have any conflicts:

    • DELETE on the partition level can't conflict in WriteSerializable isolation level (default) if the write happens without reading (i.e. append-only workload)
    • VACUUM simply removes files that aren't referenced in the latest version so it won't conflict with appends.