Search code examples
sqloracleselectsql-order-bywindow-functions

SQL Query to get last 24 contiguous hours with potentially missing hours


So I need to return the most recent 24 contiguous hours from a query. Table holds hourly data. Getting the last 24 hours is not a problem but I sometimes have missing data and therefore need to go further back in time to find the the first "full" set of 24 hours.

select date, value from TABLE
where date >= (select max(date)-1 from TABLE) 

However sometimes I have missing hours with this query. How can I ensure I always get 24 rows back and that it is the most recent block of 24 hours?

An example below:

enter image description here

Notice for category A, hour 1/31/2020 hour 23 is missing and therefore what should be returned are hours 1/31/2020 22 thru 1/30/2020 23. Category B should return hours 2/1/2020 hour 0 thru 1/31/2020 hour 1.


Solution

  • You need a few steps. First, for each record, you need to see how many hours of contiguous preceding data it has. That is what the grouped_hour_data clause does in the solution below.

    Then, you need to select from that result, getting only the rows that have a full 24 hours of contiguous preceding data. Then fetch only the first 24 rows of that.

    This solution is simplified to take advantage of the fact that all your dates were truncated to the hour and there were no duplicates. If your problem is more complicated than that, this solution can still support it but it will need to be revised.

    In this example, we create test data going back several days, but remove data from individual hours on the 16th and 17th, so that the 1st contiguous 24 hour period ends early on the 16th.

    alter session set nls_date_format = 'DD-MON-YYYY HH24:MI:SS';
    
    with hour_data_raw AS (
    SELECT to_date('17-JUN-2020 17:00:00','DD-MON-YYYY HH24:MI:SS') - ( INTERVAL '1' HOUR ) * rownum dte
    FROM dual
    CONNECT BY rownum <= 200 ),
    hour_data AS ( SELECT dte 
                   FROM hour_data_raw
                   WHERE NOT TRUNC(dte,'HH') = to_date('17-JUN-2020 02:00:00','DD-MON-YYYY HH24:MI:SS')
                   AND NOT TRUNC(dte,'HH') = to_date('16-JUN-2020 02:00:00','DD-MON-YYYY HH24:MI:SS') ),
    -- SOLUTION BEGINS HERE... everything above is just test data
    -- WITH...
       grouped_hour_data AS (
         SELECT h.*, count(trunc(h.dte,'HH')) OVER ( ORDER BY dte desc RANGE BETWEEN CURRENT ROW AND INTERVAL '1' DAY - INTERVAL '1' SECOND FOLLOWING ) cnt
         FROM hour_data h
         ORDER BY dte)
    SELECT * FROM grouped_hour_data
    WHERE cnt = 24
    ORDER BY dte desc
    FETCH FIRST 24 ROWS ONLY;
    
    +----------------------+-----+
    |         DTE          | CNT |
    +----------------------+-----+
    | 16-JUN-2020 01:00:00 |  24 |
    | 16-JUN-2020 00:00:00 |  24 |
    | 15-JUN-2020 23:00:00 |  24 |
    | 15-JUN-2020 22:00:00 |  24 |
    | 15-JUN-2020 21:00:00 |  24 |
    | 15-JUN-2020 20:00:00 |  24 |
    | 15-JUN-2020 19:00:00 |  24 |
    | 15-JUN-2020 18:00:00 |  24 |
    | 15-JUN-2020 17:00:00 |  24 |
    | 15-JUN-2020 16:00:00 |  24 |
    | 15-JUN-2020 15:00:00 |  24 |
    | 15-JUN-2020 14:00:00 |  24 |
    | 15-JUN-2020 13:00:00 |  24 |
    | 15-JUN-2020 12:00:00 |  24 |
    | 15-JUN-2020 11:00:00 |  24 |
    | 15-JUN-2020 10:00:00 |  24 |
    | 15-JUN-2020 09:00:00 |  24 |
    | 15-JUN-2020 08:00:00 |  24 |
    | 15-JUN-2020 07:00:00 |  24 |
    | 15-JUN-2020 06:00:00 |  24 |
    | 15-JUN-2020 05:00:00 |  24 |
    | 15-JUN-2020 04:00:00 |  24 |
    | 15-JUN-2020 03:00:00 |  24 |
    | 15-JUN-2020 02:00:00 |  24 |
    +----------------------+-----+
    

    EDIT: handling category field

    To handle the additional category field you added, you need to do a few things.

    First, PARTITION BY category when you are computing the cnt field. This will cause each category's data to be treated separately when computing this value. So, for example, a value in hour 2 for category A will not count as a value in hour 2 for category B.

    Second, you can no longer use FETCH FIRST 24 ROWS ONLY to get the data you want, because you need the first 24 rows in each category now. So, you need an extra step (ordered_groups, in the revised query below) to order the rows in each category that have 24 continuous hours of data preceding them. Call that ordering rn and then, in the final query, just select where rn <= 24.

    WITH grouped_hour_data AS (
         SELECT h.*, count(trunc(h.dte,'HH')) OVER ( 
                  PARTITION BY category 
                  ORDER BY dte desc 
                  RANGE BETWEEN CURRENT ROW 
                      AND INTERVAL '1' DAY - INTERVAL '1' SECOND FOLLOWING ) cnt
         FROM hour_data h
         ORDER BY dte),
     ordered_groups AS (
       SELECT ghd.*, row_number() over ( partition by ghd.category order by ghd.dte desc ) rn
       FROM   grouped_hour_data
       WHERE  ghd.cnt = 24 )
    SELECT * FROM ordered_groups
    WHERE rn <= 24;
    ORDER BY category, dte desc;
    

    Disclosure: I have not tested this updated logic so there may be some errors.