Search code examples
sqlselectsql-scriptsdata-partitioninghana-sql-script

How to create an average per partitions containing a maximum of 5 time dependent members?


My goal is to select an average of exactly 5 records only if they meet the left join criteria to another table. Let's say we have table one (left) with records:

RECNUM   ID    DATE         JOB
1      | cat | 2019.01.01 | meow
2      | dog | 2019.01.01 | bark

And we have table two (right) with records:

RECNUM   ID    Action_ID    DATE         REWARD
1      | cat | 1          | 2019.01.02 | 20
2      | cat | 99         | 2018.12.30 | 1
3      | cat | 23         | 2019.12.28 | 20       
4      | cat | 54         | 2018.01.01 | 20
5      | cat | 32         | 2018.01.02 | 20
6      | cat | 21         | 2018.01.03 | 20
7      | cat | 43         | 2018.12.28 | 1
8      | cat | 65         | 2018.12.29 | 1
9      | cat | 87         | 2018.09.12 | 1
10     | cat | 98         | 2018.10.11 | 1 
11     | dog | 56         | 2018.09.01 | 99 
12     | dog | 42         | 2019.09.02 | 99 

A result should return:

ID  | AVG(Reward_from_latest_5_jobs)
cat | 1

The criteria met should be: For each JOB from left table, try to find 5 latest but older unique Action_ID(s) for the same ID in the right table and calculate average for them. So in other words, dog has barked, we do not know what reward to give him and we try to count the average of the latest five rewards he got. If less than 5 found, do not return anything/put null, if more, discard the oldest ones.

The way I wanted to do it is like:

         SELECT a."ID", COUNT(b."Action_ID"), AVG(b."REWARD")  
         FROM 
             ( 
                SELECT "ID", "DATE"
                 FROM :left_table
             ) a  

              LEFT JOIN

             ( 
                SELECT "ID", "Action_ID", "DATE", "REWARD"
                 FROM :right_table
             ) b 

             ON(
                    a."ID" = b."ID" 
               )    
         WHERE a."DATE" > b."DATE" 
         GROUP BY a."ID"
         HAVING COUNT(b."Action_ID") >= 5;

But then it would calculate for all the Action_ID(s) that match the criteria and not only the five latest ones. Could you please tell how to achieve expected results? I can use sub-tables and it does not have to be done in one SQL statement. Procedures are not allowed for this use case. Any input highly appreciated.


Solution

  • Use window functions to get the top 5:

    select id, avg(reward)
    from (select r.*,
                 row_number() over (partition by l.id order by r.date desc) as seqnum
          from table1 l join
               table2 r
               on l.id = r.id and l.date > r.date
         ) r
    where seqnum <= 5
    group by id
    having count(*) >= 5;
    

    Then a having clause to filter out those ids that don't have five rows.