amazon-web-services scala apache-spark amazon-s3 apache-spark-sql

Spark: how to load large S3 path efficiently

I have an extensive S3 path structured with partitions by ID, year, month, and day. The S3 path follows this format: s3a://bucket/results/id=xxxxx/year=xxxx/month=xx/day=xx. Below is a code sample illustrating this structure.

results
      .write
      .mode(SaveMode.Append)
      .partitionBy("id", "year", "month", "day")
      .parquet(s"s3a://bucket/results")

Currently, I am attempting to load partial data for a specific date (yyyy-mm-dd). Each of the IDs is not guaranteed to contain data for this specific date; only a few of them have the data. However, in this S3 path, which is partitioned by ID, year, month, and day, there are 110,000 IDs, which causes my code to load very slowly. Below is the code I use to load the data from S3.

// Retrieve data for all IDs corresponding to the specific date of August 23, 2024.
spark.read
      .parquet(f"s3a://bucket/results/*/year=2024/month=08/day=23")
      .cache()

The process took about an hour; is there a way to make it faster and more efficient?

Solution

High-cardinality columns such as id are not used for on-disk partitioning usually. Consider using a partition strategy with low-cardinality columns only (year, month, date) and other formats or techniques for optimising query execution such as DeltaTable with Z-ordering or Bucketed Table.

Please see following resources for more information:

Comprehensive Guide to Optimize Databricks, Spark and Delta Lake Workloads, Databricks.
When to partition tables on Databricks, Databricks.
Bucketing, The Internals of Spark SQL.