Say I have a table as follows:
--------------------------------------------------
| Type | Incident ID | Date of incident|
--------------------------------------------------
| A | 1 | 2022-02-12 |
| A | 2 | 2022-02-14 |
| A | 3 | 2022-02-14 |
| A | 4 | 2022-02-14 |
| A | 5 | 2022-02-16 |
| A | 6 | 2022-02-17 |
| A | 7 | 2022-02-19 |
| A | 8 | 2022-02-19 |
| A | 7 | 2022-02-19 |
| A | 8 | 2022-02-19 |
... ... ...
| B | 1 | 2022-02-12 |
| B | 2 | 2022-02-12 |
| B | 3 | 2022-02-13 |
... ... ...
--------------------------------------------------
This is a list of different types of incidents. Every incident has a type, an id and a date, at which it occurred. This is just an example to help understand my goal.
What I want is - for a given range, e.g. 5 days - the maximum value that a rolling sum over these incidents would become:
So I would start with all elements that fall into the first 5 days and accumulate the occurences: 6.
2022-02-12 - 2022-02-17: 6
By starting to roll the window by one day, all elements of the first day get eliminated from the sum, in this case -1 and no element for the next day in line gets added. The next value would be 5.
2022-02-13 - 2022-02-18: 5
6 > 5. So 6 is still the maximum occurence of incidents in a 5 day window.
Continue for the complete time range.
This is not that hard to achieve but how would I do this in a very efficient manner for millions of elements? In short: I want to create a moving window of a fixed date range (e.g. 5 days), count all occurances for this window and give out the maximum value that was reached for any window.
By the way I am using sqlalchemy, but I would also be interested in plain sql.
An appropriate test set would be this:
test_data_small = {'Id': [1, 2, 3, 4, 5,
6, 7, 8, 9, 10,
0, 1, 2, 3],
'Type': ['A', 'A', 'A', 'A',
'A', 'A', 'A', 'A',
'A', 'A', 'B', 'B',
'B', 'B'],
'Date': [
'2022-02-12', '2022-02-14',
'2022-02-14', '2022-02-14',
'2022-02-16', '2022-02-17',
'2022-02-19', '2022-02-19',
'2022-02-19', '2022-02-19',
'2022-02-16', '2022-02-12',
'2022-02-12', '2022-02-13']
}
I am connecting to a table via sqlalchemy like so:
incidents = select(
incidents.c.type,
incidents.c.id,
incidents.c.date
).subquery()
result = self.connection.execute(incidents).fetchall()
Is it even possible in plain sql? Maybe I should use pandas in order to apply a rolling window?
As already mentioned in the comments, this problem should be solved in SQL
. In particular, it can be solved using plain SQL
. Notice that you therefore only need to use one join
: You can join
your table with itself on the date range which is spanned by the window size, and on the type, and then count the corresponding entries.
Using the test data you provided yields the following result.
select
t.id, t.type, t.date, count(*) as num_inc_next_5_days
from tbl as t
inner join tbl i
on t.date >= dateadd(day,-5,i.date)
and t.date <= i.date
and t.type = i.type
group by t.id, t.type, t.date
order by t.type, t.id, t.date
|----|------|------------|---------------------|
| id | type | date | num_inc_next_5_days |
|----|------|------------|---------------------|
| 1 | A | 2022-02-12 | 6 |
| 2 | A | 2022-02-14 | 9 |
| 3 | A | 2022-02-14 | 9 |
| 4 | A | 2022-02-14 | 9 |
| 5 | A | 2022-02-16 | 6 |
| 6 | A | 2022-02-17 | 5 |
| 7 | A | 2022-02-19 | 4 |
| 8 | A | 2022-02-19 | 4 |
| 9 | A | 2022-02-19 | 4 |
| 10 | A | 2022-02-19 | 4 |
| 0 | B | 2022-02-16 | 1 |
| 1 | B | 2022-02-12 | 4 |
| 2 | B | 2022-02-12 | 4 |
| 3 | B | 2022-02-13 | 2 |
The last column shows the number of incidents within the next five days, including the current date. If you want to get the maximum of these incidents, just wrap the query inside a cte
and take the maximum:
with cte as (
...
)
select max(...)
from cte
Used data:
create table tbl (
id int,
type varchar(1),
date date
)
insert into tbl values
(1, 'A', '2022-02-12'),
(2, 'A', '2022-02-14'),
(3, 'A', '2022-02-14'),
(4, 'A', '2022-02-14'),
(5, 'A', '2022-02-16'),
(6, 'A', '2022-02-17'),
(7, 'A', '2022-02-19'),
(8, 'A', '2022-02-19'),
(9, 'A', '2022-02-19'),
(10, 'A', '2022-02-19'),
(0, 'B', '2022-02-16'),
(1, 'B', '2022-02-12'),
(2, 'B', '2022-02-12'),
(3, 'B', '2022-02-13');