I have a datalake on AWS, queried using Athena, with the following structure and sample data
Key | Date | Value
----+---------------+-------
a | 01/01/2020 | 4.5
a | 05/01/2020 | 6
a | 06/01/2020 | 3.2
b | 01/01/2020 | 2.4
b | 03/01/2020 | 5
I would like to run a query to extract values
for a specific date
and for each key
. If date is not a known one, like 99% of the time, value should be returned as a linear interpolation of the two closest one.
Dates
are here reported in dd/mm/YYYY format for simplicity, but in the datalake are stored as timestamps.
Example of result
If I want to get the values
for the 2nd of January (02/01/2020), the expected output is
Key | Date | Value
----+---------------+-------
a | 02/01/2020 | 4.875
b | 02/01/2020 | 3.70
Where 4.875 is the linear interpolation between 4.5 (value at 01/01/2020) and 6 (value at 05/01/2020). I have manually evaluated it as (y - 4.5) / (2 - 1) = (6 - 4.5) / (5 - 1)
(see linear interpolation for more reference).
Same for 3.7
How can I achieve that (if possible) with one single query?
Assumption: we always have a smaller and bigger date from the point we're searching for.
Update - Athena, based on PrestoDB, does not support JOIN LATERAL
, so that is not an option I can consider
This might be a good place for lateral joins:
select d.dt,
case
when n.date = p.date then p.value
else p.value + (n.value - p.value) / datediff('day', n.date, p.date)
end as new_value
from (select date '2020-04-01') d(date)
cross join lateral (
select t.* from mytable t where t.date <= d.date order by t.date desc limit 1
) p -- "previous" value
cross join lateral (
select t.* from mytable t where t.date >= d.date order by t.date limit 1
) n -- "next" value
We can write the query without lateral joins:
select date '2020-04-01' as dt, p.k,
case
when n.date = p.date then p.value
else p.value + (n.value - p.value) / datediff('day', n.date, p.date)
end as new_value
from (
select t.*,
row_number() over(partition by k order by date desc) as rn
from mytable t
where date <= '2020-04-01'
) p
inner join (
select t.*,
row_number() over(partition by k order by date) as rn
from mytable t
where date >= '2020-04-01'
) n on n.k = p.k
where p.rn = 1 and n.rn = 1
This also generalizes the query so it can process multiple keys at once (key
is language keyword, I used k
instead).