azure pyspark query-optimization databricks delta-lake

Optimizing Delta Lake Partitioning for Timestamp Range Queries

I'm working with a Delta table in Databricks that is partitioned by a Year column, and I have a timestamp column in the table. I want to optimize my queries to fetch data within a specific timestamp range (e.g., between 21-01-2019 and 04-12-2019).

df_filtered = df.filter((df.timestamp >= "2019-01-21") & (df.timestamp <= "2019-12-04"))

I understand that partitioning can help with performance, but I'm not sure how to leverage it effectively when querying based on a timestamp range. Even though I'm not directly querying the Year partition column, I'd like to take advantage of partition pruning to only read the relevant partitions.

How can I ensure that the Year partition column is correlated with the timestamp data effectively so that partition pruning works optimally for timestamp range queries?

Solution

Documentation for Delta Table batch reads and writes mentions:

Delta Lake may be able to generate partition filters for a query whenever a partition column is defined by one of the following expressions:

YEAR(col) and the type of col is TIMESTAMP.

(...)

It means that if you have a table defined as:

CREATE TABLE a_table (
    timestamp TIMESTAMP,
    year INT GENERATED ALWAYS AS year(timestamp),
    [other columns]
) PARTITIONED BY (year);

then Databricks will be able to analyze filter conditions on timestamp column and derive proper partition filters for year. With your example:

df = spark.table("a_table")
df_filtered = df.filter((df.timestamp >= "2019-01-21") & (df.timestamp <= "2019-12-04"))

It will add a partition filter year = 2019.