Search code examples
sqlgreatest-n-per-groupprestoamazon-athenalateral-join

SQL query with linear interpolation and Group By


I have a datalake on AWS, queried using Athena, with the following structure and sample data

Key |     Date      |  Value
----+---------------+-------
 a  |   01/01/2020  |  4.5
 a  |   05/01/2020  |  6
 a  |   06/01/2020  |  3.2
 b  |   01/01/2020  |  2.4
 b  |   03/01/2020  |  5

I would like to run a query to extract values for a specific date and for each key. If date is not a known one, like 99% of the time, value should be returned as a linear interpolation of the two closest one.

Dates are here reported in dd/mm/YYYY format for simplicity, but in the datalake are stored as timestamps.

Example of result

If I want to get the values for the 2nd of January (02/01/2020), the expected output is

Key |     Date      |  Value
----+---------------+-------
 a  |   02/01/2020  |  4.875
 b  |   02/01/2020  |  3.70

Where 4.875 is the linear interpolation between 4.5 (value at 01/01/2020) and 6 (value at 05/01/2020). I have manually evaluated it as (y - 4.5) / (2 - 1) = (6 - 4.5) / (5 - 1) (see linear interpolation for more reference).

Same for 3.7

How can I achieve that (if possible) with one single query?

Assumption: we always have a smaller and bigger date from the point we're searching for.

Update - Athena, based on PrestoDB, does not support JOIN LATERAL, so that is not an option I can consider


Solution

  • This might be a good place for lateral joins:

    select d.dt, 
        case 
            when n.date = p.date then p.value
            else p.value + (n.value - p.value) / datediff('day', n.date, p.date)
        end as new_value
    from (select date '2020-04-01') d(date)
    cross join lateral (
        select t.* from mytable t where t.date <= d.date order by t.date desc limit 1
    ) p  -- "previous" value
    cross join lateral (
        select t.* from mytable t where t.date >= d.date order by t.date limit 1
    ) n  -- "next" value
    

    We can write the query without lateral joins:

    select date '2020-04-01' as dt, p.k,
        case 
            when n.date = p.date then p.value
            else p.value + (n.value - p.value) / datediff('day', n.date, p.date)
        end as new_value
    from (
        select t.*, 
            row_number() over(partition by k order by date desc) as rn
        from mytable t
        where date <= '2020-04-01'
    ) p
    inner join (
        select t.*, 
            row_number() over(partition by k order by date) as rn
        from mytable t
        where date >= '2020-04-01'
    ) n on n.k = p.k
    where p.rn = 1 and n.rn = 1
    

    This also generalizes the query so it can process multiple keys at once (key is language keyword, I used k instead).