Search code examples
azurepysparkquery-optimizationdatabricksdelta-lake

Optimizing Delta Lake Partitioning for Timestamp Range Queries


I'm working with a Delta table in Databricks that is partitioned by a Year column, and I have a timestamp column in the table. I want to optimize my queries to fetch data within a specific timestamp range (e.g., between 21-01-2019 and 04-12-2019).

df_filtered = df.filter((df.timestamp >= "2019-01-21") & (df.timestamp <= "2019-12-04"))

I understand that partitioning can help with performance, but I'm not sure how to leverage it effectively when querying based on a timestamp range. Even though I'm not directly querying the Year partition column, I'd like to take advantage of partition pruning to only read the relevant partitions.

How can I ensure that the Year partition column is correlated with the timestamp data effectively so that partition pruning works optimally for timestamp range queries?


Solution

  • Documentation for Delta Table batch reads and writes mentions:

    Delta Lake may be able to generate partition filters for a query whenever a partition column is defined by one of the following expressions:

    • YEAR(col) and the type of col is TIMESTAMP.
    • (...)

    It means that if you have a table defined as:

    CREATE TABLE a_table (
        timestamp TIMESTAMP,
        year INT GENERATED ALWAYS AS year(timestamp),
        [other columns]
    ) PARTITIONED BY (year);
    

    then Databricks will be able to analyze filter conditions on timestamp column and derive proper partition filters for year. With your example:

    df = spark.table("a_table")
    df_filtered = df.filter((df.timestamp >= "2019-01-21") & (df.timestamp <= "2019-12-04"))
    

    It will add a partition filter year = 2019.