Search code examples
pythonsqlpandassqlalchemyrolling-computation

How to calculate the maximum occurence in a rolling window?


Say I have a table as follows:

--------------------------------------------------
| Type       | Incident ID     | Date of incident|
--------------------------------------------------
| A          | 1               | 2022-02-12      |
| A          | 2               | 2022-02-14      |
| A          | 3               | 2022-02-14      |
| A          | 4               | 2022-02-14      |
| A          | 5               | 2022-02-16      |
| A          | 6               | 2022-02-17      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |
| A          | 7               | 2022-02-19      |
| A          | 8               | 2022-02-19      |

 ...          ...               ...             

| B          | 1               | 2022-02-12      |
| B          | 2               | 2022-02-12      |
| B          | 3               | 2022-02-13      |

 ...          ...               ...             

--------------------------------------------------

This is a list of different types of incidents. Every incident has a type, an id and a date, at which it occurred. This is just an example to help understand my goal.

What I want is - for a given range, e.g. 5 days - the maximum value that a rolling sum over these incidents would become:

So I would start with all elements that fall into the first 5 days and accumulate the occurences: 6.

2022-02-12 - 2022-02-17:    6

By starting to roll the window by one day, all elements of the first day get eliminated from the sum, in this case -1 and no element for the next day in line gets added. The next value would be 5.

2022-02-13 - 2022-02-18:    5

6 > 5. So 6 is still the maximum occurence of incidents in a 5 day window.

Continue for the complete time range.

This is not that hard to achieve but how would I do this in a very efficient manner for millions of elements? In short: I want to create a moving window of a fixed date range (e.g. 5 days), count all occurances for this window and give out the maximum value that was reached for any window.

By the way I am using sqlalchemy, but I would also be interested in plain sql.

An appropriate test set would be this:

test_data_small = {'Id': [1, 2, 3, 4, 5,
                                6, 7, 8, 9, 10,
                                0, 1, 2, 3],
                   'Type': ['A', 'A', 'A', 'A',
                               'A', 'A', 'A', 'A',
                               'A', 'A', 'B', 'B',
                               'B', 'B'],
                   'Date': [
                       '2022-02-12', '2022-02-14',
                       '2022-02-14', '2022-02-14',
                       '2022-02-16', '2022-02-17',
                       '2022-02-19', '2022-02-19',
                       '2022-02-19', '2022-02-19',
                       '2022-02-16', '2022-02-12',
                       '2022-02-12', '2022-02-13']
                   }

I am connecting to a table via sqlalchemy like so:

incidents = select(
            incidents.c.type,
            incidents.c.id,
            incidents.c.date
        ).subquery()

result = self.connection.execute(incidents).fetchall()

Is it even possible in plain sql? Maybe I should use pandas in order to apply a rolling window?


Solution

  • As already mentioned in the comments, this problem should be solved in SQL. In particular, it can be solved using plain SQL. Notice that you therefore only need to use one join: You can join your table with itself on the date range which is spanned by the window size, and on the type, and then count the corresponding entries.

    Using the test data you provided yields the following result.

    select
    t.id, t.type, t.date, count(*) as num_inc_next_5_days
    from tbl as t
        inner join tbl i
                on t.date >= dateadd(day,-5,i.date)
                   and t.date <= i.date
                   and t.type = i.type
    group by t.id, t.type, t.date
    order by t.type, t.id, t.date
    
    |----|------|------------|---------------------|
    | id | type | date       | num_inc_next_5_days |
    |----|------|------------|---------------------|
    | 1  | A    | 2022-02-12 | 6                   |
    | 2  | A    | 2022-02-14 | 9                   |
    | 3  | A    | 2022-02-14 | 9                   |
    | 4  | A    | 2022-02-14 | 9                   |
    | 5  | A    | 2022-02-16 | 6                   |
    | 6  | A    | 2022-02-17 | 5                   |
    | 7  | A    | 2022-02-19 | 4                   |
    | 8  | A    | 2022-02-19 | 4                   |
    | 9  | A    | 2022-02-19 | 4                   |
    | 10 | A    | 2022-02-19 | 4                   |
    | 0  | B    | 2022-02-16 | 1                   |
    | 1  | B    | 2022-02-12 | 4                   |
    | 2  | B    | 2022-02-12 | 4                   |
    | 3  | B    | 2022-02-13 | 2                   |
    

    The last column shows the number of incidents within the next five days, including the current date. If you want to get the maximum of these incidents, just wrap the query inside a cte and take the maximum:

    with cte as (
    ...
    )
    select max(...) 
    from cte
    

    Used data:

    create table tbl (
        id int,
        type varchar(1),
        date date
    )
    
    insert into tbl values
    (1, 'A', '2022-02-12'),
    (2, 'A', '2022-02-14'),
    (3, 'A', '2022-02-14'),
    (4, 'A', '2022-02-14'),
    (5, 'A', '2022-02-16'),
    (6, 'A', '2022-02-17'),
    (7, 'A', '2022-02-19'),
    (8, 'A', '2022-02-19'),
    (9, 'A', '2022-02-19'),
    (10, 'A', '2022-02-19'),
    (0, 'B', '2022-02-16'),
    (1, 'B', '2022-02-12'),
    (2, 'B', '2022-02-12'),
    (3, 'B', '2022-02-13');