Search code examples
pandasdataframenumpyrolling-computationaccumulate

Rolling count unique in dataframe's rows or ndarray


Given a dataframe df, how to calculate the rolling count of unique values through rows' direction subject to a boundary condition: window size = n?

Input data:

import pandas as pd
import numpy as np

data = {'col_0':[7, 8, 9, 10, 11, 12],
        'col_1':[4, 5, 6, 7, 8, 9],
        'col_2':[2, 5, 8, 11, 14, 15],
        'col_3':[2, 6, 10, 14, 18, 21],
        'col_4':[7, 5, 7, 5, 7, 5],
        'col_5':[2, 6, 10, 14, 18, 21]}
df = pd.DataFrame(data)
print(df)
###
   col_0  col_1  col_2  col_3  col_4  col_5
0      7      4      2      2      7      2
1      8      5      5      6      5      6
2      9      6      8     10      7     10
3     10      7     11     14      5     14
4     11      8     14     18      7     18
5     12      9     15     21      5     21

Expected output (with window size = 2):

print(df)
###
   col_0  col_1  col_2  col_3  col_4  col_5  rolling_nunique
0      7      4      2      2      7      2                3
1      8      5      5      6      5      6                6
2      9      6      8     10      7     10                6
3     10      7     11     14      5     14                8
4     11      8     14     18      7     18                7
5     12      9     15     21      5     21               10

For the example above with window size = 2.

  • At window 0's array we have row[0].

    [[7 4 2 2 7 2]]
    

    rolling_nunique[0] is 3 with the elements being [2, 4, 7].

  • At window 1's array we have row[0] & row[1].

    [[7 4 2 2 7 2]
    [8 5 5 6 5 6]]
    

    rolling_nunique[1] is 6 with the elements being [2, 4, 5, 6, 7, 8].

  • At window 2's array we have row[1] & row[2].

    [[ 8  5  5  6  5  6]
    [ 9  6  8 10  7 10]]
    

    rolling_nunique[2] is 6 with the elements being [5, 6, 7, 8, 9, 10].

  • etc.


Solution

  • Using sliding_window_view, you can customize how the values are aggregated in the sliding window. To get values for all rows before the sliding window is full (i.e., emulate min_periods=1 in pd.rolling), we need to add some empty rows at the top. This can be done using vstack and full. At the end, we need to account for these added nan values by filtering them away.

    from numpy.lib.stride_tricks import sliding_window_view
    
    w = 2
    values = np.vstack([np.full([w-1, df.shape[1]], np.nan), df.values])
    m = sliding_window_view(values, w, axis=0).reshape(len(df), -1)
    unique_count = [len(np.unique(r[~np.isnan(r)])) for r in m]
    df['rolling_nunique'] = unique_count
    

    Result:

    
       col_0  col_1  col_2  col_3  col_4  col_5  rolling_nunique
    0      7      4      2      2      7      2                3
    1      8      5      5      6      5      6                6
    2      9      6      8     10      7     10                6
    3     10      7     11     14      5     14                8
    4     11      8     14     18      7     18                7
    5     12      9     15     21      5     21               10