Search code examples
sqlpostgresqlgreatest-n-per-groupwindow-functions

Find greatest and least column values in a sliding window frame


Coming from procedural programming, I am thinking of a loop to solve this. Example logic:

  • Sort by column1 value and date column descending. Query similar to:
select * 
from table1 
where column1 in ('cpu1') 
order by date_time desc;
  • Find the last row from the select query as well as the row 22 lines above it to define the loop start point for the first set of data to review. i do not have the rows serialized currently but know the time will be 5 minutes between each row entry so I could use the timestamp to determine 22 rows. I think the query looks something like:
select * 
from table1 
where date_time between x and y;
  • find the 10th row in the data selected in the between statement and determine if column2 in the 10th row is greater than all 12 rows above and all 9 rows below based or column 3 10th row is less than 12 rows above and 9 rows below. If it is greater or less than print 10th row to a new table.

I don't have anything for this query yet because I'm not understanding how to navigate table indexing in the fashion you might with Javascript arrays or Python dictionaries.

  • increment x, y and 10th row values -1 and run loop again.

The problem

I'm struggling with understanding how to put a PostgreSQL script together. I've designed similar style loops with Python or Javascript but the more I read about possible ways to go about it in SQL the more confused I'm getting. Is the most optimal way to do this CTE, for, while or a cursor?

Example table data:

column1 column2 column3 date
cpu1 3.2 1.1 01:45:00
cpu1 3.5 1.3 01:40:00
cpu1 5.9 0.1 01:35:00
cpu1 1.7 1.2 01:30:00
cpu1 2.5 2.3 01:25:00
cpu1 5.2 4.1 01:20:00
cpu3 4.2 1.1 01:45:00
cpu3 3.1 1.0 01:40:00
cpu3 5.0 3.3 01:35:00
cpu3 4.7 3.2 01:30:00
cpu3 2.7 2.1 01:25:00
cpu3 6.2 4.8 01:20:00
  • column1 will be device names that write time series to the database
  • column2 and column3 will be recorded values from the device
  • date will always be in 5 minute increments
  • The table1 has millions of rows

Solution

  • This can be implemented elegantly and efficiently with window functions:

    For all devices:

    SELECT *
    FROM  (
       SELECT *, col2 > max(col2) OVER (PARTITION BY device ORDER BY ts ROWS BETWEEN 12 PRECEDING AND 9 FOLLOWING EXCLUDE CURRENT ROW)
              OR col3 < min(col3) OVER (PARTITION BY device ORDER BY ts ROWS BETWEEN 12 PRECEDING AND 9 FOLLOWING EXCLUDE CURRENT ROW) AS bingo
       FROM   tbl
       ) sub
    WHERE  bingo;
    

    For a single given device:

    SELECT *
    FROM  (
       SELECT *
            , col2 > max(col2) OVER (ORDER BY ts ROWS BETWEEN 12 PRECEDING AND 9 FOLLOWING EXCLUDE CURRENT ROW) OR
              col3 < min(col3) OVER (ORDER BY ts ROWS BETWEEN 12 PRECEDING AND 9 FOLLOWING EXCLUDE CURRENT ROW) AS bingo
       FROM   tbl
       WHERE  device = 'cpu3'
       ) sub
    WHERE  bingo;
    

    fiddle

    Assuming all columns are defined NOT NULL, and the timestamp (ts in my query) is also UNIQUE per device. Or you have to define how to deal with null values and duplicates in the sort order.

    This uses a whole arsenal of advanced window function features. The "frame exclusion" option EXCLUDE CURRENT ROW requires Postgres 11 or later. Read the manual here. Related:

    Not an easy task if you are accustomed to procedural languages.