Search code examples
pythonpandaspandas-groupbyrolling-computationpercentile

Grouped percentile rank of value in rolling time window


In these sample data, users place orders of certain random values at random dates in time. I've successfully implemented a method to calculate the percentile rank of each value regarding the last 180 days of orders of that same user.

However, for large values of n the last groupby line of code runs very slow (1M rows run in about 1m30s) Does anyone have a suggestion on how to improve computing time?

import pandas as pd
import numpy as np
from scipy.stats import percentileofscore

#percentile rank function
def rank(x, kind):
    return percentileofscore(x, score = x.iloc[-1], kind = kind)

#sample data
n = 10000
orders = pd.DataFrame({
    'user':np.random.randint(1, 100, size = n),
    'value':np.random.randn(n),
    'date':np.random.choice( pd.date_range('1/1/2019', periods=730,
                          freq='D'), n)
    })

orders_sort = orders.sort_values(by = ['user', 'date']).reset_index(drop =True)

#group by time window percentile rank - SLOW!
orders_sort.groupby('user')[['value', 'date']].rolling('180d', on = 'date').apply(lambda x: rank(x, kind = 'mean'))

               value       date
user
1    0     50.000000 2019-01-03
     1     75.000000 2019-01-10
     2     83.333333 2019-01-12
     3     87.500000 2019-01-17
     4     10.000000 2019-01-22
...              ...        ...
99   9995  19.565217 2020-11-23
     9996  64.583333 2020-11-26
     9997  39.583333 2020-12-04
     9998  54.000000 2020-12-05
     9999   6.000000 2020-12-12

[10000 rows x 2 columns]

Solution

  • you can leverage the parameter raw=True in the apply to pass a numpy array instead of Series. You need to slightly change your function to work with an array.

    def rank_np(x, kind):
        return percentileofscore(x, score = x[-1], kind = kind) #no iloc as x is an array
    

    then like you did bu with the parameter raw:

    orders_sort.groupby('user')[['value', 'date']]\
      .rolling('180d', on = 'date')\
      .apply(lambda x: rank_np(x, kind = 'mean'), raw=True) #see here
    

    I get a speed up of 6.5 time faster with n=10K or 50K, not sure how it behaves for n=1M rows