I have a columnar table that is partitioned by day and hour. It is stored on S3 in parquet files to be queried by Athena. Here is the CREATE TABLE:
CREATE EXTERNAL TABLE foo (
-- other columns here
dt timestamp,
day string,
hour string
)
PARTITIONED BY (day string, hour string)
STORED AS parquet
LOCATION 's3://foo/foo'
And the layout on S3 is like:
s3://foo/foo/day=2021-10-10/hh=00/*.parquet
s3://foo/foo/day=2021-10-10/hh=01/*.parquet
...etc
s3://foo/foo/day=2021-10-10/hh=23/*.parquet
So a query like the following will be fast because it only scans over one hour of parquet files because the partition columns are being used to filter it:
-- fast, easy to write
SELECT * FROM foo WHERE day = '2021-10-10' AND hour = '00'
However, the table also includes the full datetime dt
. Usually we want to write queries for ranges that don't align to a day/hour boundary, and/or are in a different timezone.
For example, this will scan ALL parquet files and be very slow:
-- slow, easy to write
SELECT * FROM foo WHERE dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
It can be improved by manually calculating the day
and hour
that minimally enclose the time period:
-- fast, painful to write
SELECT * FROM foo
WHERE
((day, hh) IN (('2021-10-09', '23'), ('2021-10-11', '00')) OR day = '2021-10-10')
AND
dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
Ideally this extra condition could be added transparently by the database so as to avoid having to manually add the ((day,hh) IN (...))
.
Is this possible somehow with Athena?
I've wished for this feature many times, but unfortunately Athena does not support it. You have to include both the predicate for the dt
column and the day
and hour
partition keys.