apache-spark amazon-s3 pyspark amazon-emr delta-lake

Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs

I'm writing a lot of data into Databricks Delta lake using the open source version, running on AWS EMR with S3 as storage layer. I'm using EMRFS.

For performance improvements, I'm compacting and vacuuming the table every so often like so:

    spark.read.format("delta").load(s3path)
            .repartition(num_files)
            .write.option("dataChange", "false").format("delta").mode("overwrite").save(s3path)
    
    t = DeltaTable.forPath(spark, path)
    t.vacuum(24)

It's then deleting 100k's of files from S3. However, the vacuum step takes an extremly long time. During this time, it appears the job is idle, however every ~5-10 minutes there will be a small task that indicates the job is alive and doing something.

I've read through this post Spark: long delay between jobs which seems to suggest it may be related to parquet? But I don't see any options on the delta side to tune any parameters.

Solution

I've also observed that the Delta vacuum command is quite slow. The open source developers are probably limited from making AWS specific optimizations in the repo because this library is cross platform (needs to work on all clouds).

I've noticed that vacuum is even slow locally. You can clone the Delta repo, run the test suite on your local machine, and see for yourself.

Deleting hundreds of thousands of files stored in S3 is slow, even if you're using the AWS CLI. You should see if you can refactor your compaction operation to create fewer files that need to be vacuumed.

Suppose your goal is to create 1GB files. Perhaps you have 15,000 one-gig files and 20,000 small files. Right now, your compaction operation is rewriting all of the data (so all 35,000 original files need to be vacuumed post-compaction). Try to refactor your code to only compact the 20,000 small files (so the vacuum operation only needs to delete 20,000 files).

The real solution is to build a vacuum command that's optimized for AWS. Delta Lake needs to work with all the popular clouds and the local filesystem. It should be pretty easy to make an open source library that reads the transaction log, figures out what files need to be deleted, makes a performant file deletion API call, and then writes out an entry to the transaction log that's Delta compliant. Maybe I'll make that repo ;)

Here's more info on the vacuum command. As a sidenote, you may way to use coalesce instead of repartition when compacting, as described here.

EDIT: Delta issue: https://github.com/delta-io/delta/issues/395 and PR: https://github.com/delta-io/delta/pull/416