I've got a dataset with group id, dates and locations. I'd like to count the number of unique locations per row per group for the last n1 and the next n2 days. In the example below, n1=2 and n2=3 and I'd like the solution based on dynamic n1 and n2. it's a rather large dataset so performance is key. n1 and n2 include the row's date as 1 day.
data = {
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'DATE': ['1/01/2023', '1/01/2023', '1/01/2023', '2/01/2023', '2/01/2023', '3/01/2023', '3/01/2023', '3/01/2023', '3/01/2023', '1/06/2023', '2/06/2023', '3/06/2023', '4/06/2023', '5/06/2023', '5/06/2023', '5/06/2023', '6/06/2023'],
'LOCATION': ['A1', 'A1', 'A2', 'A1', 'A2', 'A2', 'A3', 'A3', 'A4', 'B1', 'B2', 'B2', 'B3', 'B4', 'B2', 'B1', 'B1']
}
df = pd.DataFrame(data)
desired output: # n1=2 # n2=3
GROUP DATE LOCATION LOCS_LAST_n1_DAYS LOCS_NEXT_n2_DAYS
0 A 1/01/2023 A1 1 4
1 A 1/01/2023 A1 1 4
2 A 1/01/2023 A2 2 4
3 A 2/01/2023 A1 2 4
4 A 2/01/2023 A2 2 3
5 A 3/01/2023 A2 2 3
6 A 3/01/2023 A3 3 2
7 A 3/01/2023 A3 3 2
8 A 3/01/2023 A4 4 1
9 B 1/06/2023 B1 1 2
10 B 2/06/2023 B2 2 2
11 B 3/06/2023 B2 1 3
12 B 4/06/2023 B3 2 4
13 B 5/06/2023 B4 2 3
14 B 5/06/2023 B2 3 2
15 B 5/06/2023 B1 4 1
16 B 6/06/2023 B1 3 1
You can use a groupby.rolling
operations. Because those only works on numeric data, first factorize
the column:
days = {'n1': 2, 'n2': 3}
df['DATE'] = pd.to_datetime(df['DATE'], dayfirst=True)
df = df.sort_values(by=['GROUP', 'DATE'])
out = df.join(pd.concat([
(pd.Series(pd.factorize(df['LOCATION'])[0],
index=df['DATE'])
.groupby(df['GROUP'].values)
.rolling(f'{v}D')
.apply(lambda x: x.nunique()).to_frame(name=f'LOCS_LAST_{k}_DAYS')
) for k, v in days.items()], axis=1)
.set_axis(df.index)
)
Output:
GROUP DATE LOCATION LOCS_LAST_n1_DAYS LOCS_LAST_n2_DAYS
0 A 2023-01-01 A1 1.0 1.0
1 A 2023-01-01 A1 1.0 1.0
2 A 2023-01-01 A2 2.0 2.0
3 A 2023-01-02 A1 2.0 2.0
4 A 2023-01-02 A2 2.0 2.0
5 A 2023-01-03 A2 2.0 2.0
6 A 2023-01-03 A3 3.0 3.0
7 A 2023-01-03 A3 3.0 3.0
8 A 2023-01-03 A4 4.0 4.0
9 B 2023-06-01 B1 1.0 1.0
10 B 2023-06-02 B2 2.0 2.0
11 B 2023-06-03 B2 1.0 2.0
12 B 2023-06-04 B3 2.0 2.0
13 B 2023-06-05 B4 2.0 3.0
14 B 2023-06-05 B2 3.0 3.0
15 B 2023-06-05 B1 4.0 4.0
16 B 2023-06-06 B1 3.0 4.0