I'm working with a Delta table in Databricks that is partitioned by a Year
column, and I have a timestamp column in the table. I want to optimize my queries to fetch data within a specific timestamp range (e.g., between 21-01-2019 and 04-12-2019).
df_filtered = df.filter((df.timestamp >= "2019-01-21") & (df.timestamp <= "2019-12-04"))
I understand that partitioning can help with performance, but I'm not sure how to leverage it effectively when querying based on a timestamp range. Even though I'm not directly querying the Year
partition column, I'd like to take advantage of partition pruning to only read the relevant partitions.
How can I ensure that the Year
partition column is correlated with the timestamp data effectively so that partition pruning works optimally for timestamp range queries?
Documentation for Delta Table batch reads and writes mentions:
Delta Lake may be able to generate partition filters for a query whenever a partition column is defined by one of the following expressions:
- YEAR(col) and the type of col is TIMESTAMP.
- (...)
It means that if you have a table defined as:
CREATE TABLE a_table (
timestamp TIMESTAMP,
year INT GENERATED ALWAYS AS year(timestamp),
[other columns]
) PARTITIONED BY (year);
then Databricks will be able to analyze filter conditions on timestamp
column and derive proper partition filters for year
. With your example:
df = spark.table("a_table")
df_filtered = df.filter((df.timestamp >= "2019-01-21") & (df.timestamp <= "2019-12-04"))
It will add a partition filter year = 2019
.