Search code examples
sqlpostgresqlgroup-bydistinct-on

"distinct on" with group by postgres


I have the following records:

id  run_hour               performance_hour      value
2  "2017-06-25 09:00:00"  "2017-06-25 07:00:00"    6
2  "2017-06-25 09:00:00"  "2017-06-25 08:00:00"    5
1  "2017-06-25 09:00:00"  "2017-06-25 08:00:00"    5
2  "2017-06-25 08:00:00"  "2017-06-25 07:00:00"    5
1  "2017-06-25 08:00:00"  "2017-06-25 07:00:00"    5

We run every hour to look at the result of each id for current hour and previous hours.

Only if there was a change from previous hour run we insert a new reocrd (we don't want to overwrite the value because we want to measure the value if looked after 1 hour, or 2 hours etc.

I would like to sum for each id in the latest available value (sort by run_hour) - the values.

In the above example, ad 1 for run 9:00 and performance hour of 7:00 doesn't have a record - because it was the same as the run of 8:00 and performance hour of 7:00

In the above example, if I ask for sum of values of run 2017-06-25 09:00:00 I would expect to get:

id, value
1   10
2   11

for id 1, it's 10 calculated: (run_hour<2017-06-25 08:00:00> + run_hour<2017-06-25 09:00:00>) and for id 2, it's 11 calculated: (run_hour<2017-06-25 09:00:00> + run_hour<2017-06-25 09:00:00>) I wrote the following query:

select distinct on (id, run_hour) id, sum(value) from metrics where  run_hour <= '2017-06-25 09:00' and performance_hour >= '2017-06-25 07:00' and  performance_hour < '2017-06-25 09:00'
group by id
order by id, run_hour

However I get an error that also run_hour has to be in the GROUP BY clause. - but if I add it I get incorrect data - also data for previous hours which I don't need - I need the latest hour that had data.

How can I use "distinct on" with group by?


Solution

  • The task is very complicated. Let's say you want the performance hours 7:00 till 9:00 from the following data:

    id  run_hour               performance_hour      value
    2   "2017-06-25 09:00:00"  "2017-06-25 06:00:00"    6
    2   "2017-06-25 09:00:00"  "2017-06-25 10:00:00"    5
    

    The expected result would be 18 (6 for 7:00 + 6 for 8:00 + 6 for 9:00) all based on the 6:00 record which itself is outside the desired time range.

    We need a recursive CTE starting from the first wanted performance hour per id till the last wanted one. Thus we build records that don't exist and that we can sum up later.

    with recursive cte(id, run_hour, performance_hour, value) as
    (
      select *
      from
      (
        select distinct on (id) 
          id, 
          run_hour,
          greatest(performance_hour, timestamp '2017-06-25 07:00') as performance_hour, 
          value
        from metrics
        where run_hour = timestamp '2017-06-25 09:00' 
          and performance_hour <= timestamp '2017-06-25 07:00'
        order by id, metrics.performance_hour desc
      ) start_by_id
      union all
      select 
        cte.id, 
        cte.run_hour,
        cte.performance_hour + interval '1 hour' as performance_hour,
        coalesce(m.value, cte.value) as value
      from cte
      left join metrics m on m.id = cte.id
                          and m.run_hour = cte.run_hour
                          and m.performance_hour = cte.performance_hour + interval '1 hour'
      where cte.performance_hour < timestamp '2017-06-25 09:00'
    )
    select id, sum(value)
    from cte
    group by id;
    

    Rextester link: http://rextester.com/PHC88770