sql greatest-n-per-group presto amazon-athena lateral-join

SQL query with linear interpolation and Group By

I have a datalake on AWS, queried using Athena, with the following structure and sample data

Key |     Date      |  Value
----+---------------+-------
 a  |   01/01/2020  |  4.5
 a  |   05/01/2020  |  6
 a  |   06/01/2020  |  3.2
 b  |   01/01/2020  |  2.4
 b  |   03/01/2020  |  5

I would like to run a query to extract values for a specific date and for each key. If date is not a known one, like 99% of the time, value should be returned as a linear interpolation of the two closest one.

Dates are here reported in dd/mm/YYYY format for simplicity, but in the datalake are stored as timestamps.

Example of result

If I want to get the values for the 2nd of January (02/01/2020), the expected output is

Key |     Date      |  Value
----+---------------+-------
 a  |   02/01/2020  |  4.875
 b  |   02/01/2020  |  3.70

Where 4.875 is the linear interpolation between 4.5 (value at 01/01/2020) and 6 (value at 05/01/2020). I have manually evaluated it as (y - 4.5) / (2 - 1) = (6 - 4.5) / (5 - 1) (see linear interpolation for more reference).

Same for 3.7

How can I achieve that (if possible) with one single query?

Assumption: we always have a smaller and bigger date from the point we're searching for.

Update - Athena, based on PrestoDB, does not support JOIN LATERAL, so that is not an option I can consider

Solution

This might be a good place for lateral joins:

select d.dt, 
    case 
        when n.date = p.date then p.value
        else p.value + (n.value - p.value) / datediff('day', n.date, p.date)
    end as new_value
from (select date '2020-04-01') d(date)
cross join lateral (
    select t.* from mytable t where t.date <= d.date order by t.date desc limit 1
) p  -- "previous" value
cross join lateral (
    select t.* from mytable t where t.date >= d.date order by t.date limit 1
) n  -- "next" value

We can write the query without lateral joins:

select date '2020-04-01' as dt, p.k,
    case 
        when n.date = p.date then p.value
        else p.value + (n.value - p.value) / datediff('day', n.date, p.date)
    end as new_value
from (
    select t.*, 
        row_number() over(partition by k order by date desc) as rn
    from mytable t
    where date <= '2020-04-01'
) p
inner join (
    select t.*, 
        row_number() over(partition by k order by date) as rn
    from mytable t
    where date >= '2020-04-01'
) n on n.k = p.k
where p.rn = 1 and n.rn = 1

This also generalizes the query so it can process multiple keys at once (key is language keyword, I used k instead).