Search code examples
pythonpandasgroup-bydynamicunique

get the number of unique rows in the last/next dynamically generated n rows for each row in a group


I've got a dataset with group id, dates and locations. I'd like to count the number of unique locations per row per group for the last n1 and the next n2 days. In the example below, n1=2 and n2=3 and I'd like the solution based on dynamic n1 and n2. it's a rather large dataset so performance is key. n1 and n2 include the row's date as 1 day.

data = {
        'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
        'DATE': ['1/01/2023', '1/01/2023', '1/01/2023', '2/01/2023', '2/01/2023', '3/01/2023', '3/01/2023', '3/01/2023', '3/01/2023', '1/06/2023', '2/06/2023', '3/06/2023', '4/06/2023', '5/06/2023', '5/06/2023', '5/06/2023', '6/06/2023'],
        'LOCATION': ['A1', 'A1', 'A2', 'A1', 'A2', 'A2', 'A3', 'A3', 'A4', 'B1', 'B2', 'B2', 'B3', 'B4', 'B2', 'B1', 'B1']
    }
df = pd.DataFrame(data)
desired output:                            # n1=2              # n2=3

       GROUP       DATE       LOCATION  LOCS_LAST_n1_DAYS  LOCS_NEXT_n2_DAYS
0      A        1/01/2023       A1                 1                 4
1      A        1/01/2023       A1                 1                 4
2      A        1/01/2023       A2                 2                 4
3      A        2/01/2023       A1                 2                 4
4      A        2/01/2023       A2                 2                 3
5      A        3/01/2023       A2                 2                 3
6      A        3/01/2023       A3                 3                 2
7      A        3/01/2023       A3                 3                 2
8      A        3/01/2023       A4                 4                 1
9      B        1/06/2023       B1                 1                 2
10     B        2/06/2023       B2                 2                 2
11     B        3/06/2023       B2                 1                 3
12     B        4/06/2023       B3                 2                 4
13     B        5/06/2023       B4                 2                 3
14     B        5/06/2023       B2                 3                 2
15     B        5/06/2023       B1                 4                 1
16     B        6/06/2023       B1                 3                 1

Solution

  • You can use a groupby.rolling operations. Because those only works on numeric data, first factorize the column:

    days = {'n1': 2, 'n2': 3}
    
    df['DATE'] = pd.to_datetime(df['DATE'], dayfirst=True)
    df = df.sort_values(by=['GROUP', 'DATE'])
    
    out = df.join(pd.concat([
               (pd.Series(pd.factorize(df['LOCATION'])[0],
                          index=df['DATE'])
                  .groupby(df['GROUP'].values)
                  .rolling(f'{v}D')
                  .apply(lambda x: x.nunique()).to_frame(name=f'LOCS_LAST_{k}_DAYS')
                 ) for k, v in days.items()], axis=1)
                   .set_axis(df.index)
                 )
    

    Output:

       GROUP       DATE LOCATION  LOCS_LAST_n1_DAYS  LOCS_LAST_n2_DAYS
    0      A 2023-01-01       A1                1.0                1.0
    1      A 2023-01-01       A1                1.0                1.0
    2      A 2023-01-01       A2                2.0                2.0
    3      A 2023-01-02       A1                2.0                2.0
    4      A 2023-01-02       A2                2.0                2.0
    5      A 2023-01-03       A2                2.0                2.0
    6      A 2023-01-03       A3                3.0                3.0
    7      A 2023-01-03       A3                3.0                3.0
    8      A 2023-01-03       A4                4.0                4.0
    9      B 2023-06-01       B1                1.0                1.0
    10     B 2023-06-02       B2                2.0                2.0
    11     B 2023-06-03       B2                1.0                2.0
    12     B 2023-06-04       B3                2.0                2.0
    13     B 2023-06-05       B4                2.0                3.0
    14     B 2023-06-05       B2                3.0                3.0
    15     B 2023-06-05       B1                4.0                4.0
    16     B 2023-06-06       B1                3.0                4.0