Search code examples
pandasdataframegroup-byrolling-computation

Pandas rolling function on categorical variables


I have a pandas dataframe like this

group cat
1     0
2     0
1     0
1     1
2     0
2     1 
1     2
1     2

I'm trying to group the data by group, then applies a custom function to the past 5 rows. The custom function looks like this

def unalikeability(data):

    num_observations = data.shape[0]
    counts = data.value_counts()

    return 1 - ((counts / num_observations)**2).sum()

Desired output:

group unalikeability
1     result calculated by the function
1
1
1
2
2
2
2

I can get the past 5 rows using groupby().rolling(), but the rolling object in pandas doesn't have the shape/ value_counts attribute and method like a DataFrame. I tried creating a DataFrame from the rolling object, but this isn't allowed either.


Solution

  • You can apply your function. Depending on whether you want the output to be computed only on full chunks (5 values), or chunks of any size, use min_periods:

    def unalikeability(data):
    
        num_observations = data.shape[0]
        counts = data.value_counts()
    
        return 1 - ((counts / num_observations)**2).sum()
    
    # compute the score only if we have 5 rows
    df['out1'] = (df.groupby('group')
                    .rolling(5)['cat']
                    .apply(unalikeability)
                    .droplevel('group')
                  )
    
    # compute the score with incomplete chunks
    df['out2'] = (df.groupby('group')
                    .rolling(5, min_periods=1)['cat']
                    .apply(unalikeability)
                    .droplevel('group')
                  )
    

    Output:

       group  cat  out1      out2
    0      1    0   NaN  0.000000
    1      2    0   NaN  0.000000
    2      1    0   NaN  0.000000
    3      1    1   NaN  0.444444
    4      2    0   NaN  0.000000
    5      2    1   NaN  0.444444
    6      1    2   NaN  0.625000
    7      1    2  0.64  0.640000