When creating a function, and using rolling( ) with the apply( ) to calculate a rolling 3 day percentile distribution, it is displaying 0's after the first 3 days for the rest of the Column.
I'm assuming that the first 2 days which have NaN Values are not being used in the calculation of the percentile function, and therefore maybe defaulting the rest of the columns to Zero, and incorrectly giving the 33 value for the third day. But im not sure about this.
I have been trying to solve this, but have not got any solution. Does anybody know why and how to solve correct this code below ? it would be greatly appreciated.
import pandas as pd
import numpy as np
from scipy import stats
data = { 'a': [1, 15, 27, 399, 17, 568, 200, 9],
'b': [2, 30, 15, 60, 15, 80, 53, 41],
'c': [100,200, 3, 78, 25, 88, 300, 91],
'd': [4, 300, 400, 500, 23, 43, 9, 71]
}
dfgrass = pd.DataFrame(data)
def percnum(x):
for t in dfgrass.index:
aaa = (x<=dfgrass.loc[t,'b']).value_counts()
ccc = (x<=dfgrass.loc[t, 'b']).values.sum()
vvv = len(x)
nnn = ccc/ vvv
return nnn * 100
dfgrass['e'] = dfgrass['b'].rolling(window=3).apply(percnum)
print(dfgrass)
Another option for what you are attempting is to directly apply pandas' rank
method with pct=True
in your function. This will run the percentile method directly on the subset defined by the rolling window. This can be done like so:
def rolling_percentile(x):
d = pd.DataFrame(x)
d['rolling'] = d.rank(pct=True)
return d.iloc[-1, 1]
Then you can insert that into your apply:
df['rolling_apply'] = df[column].rolling(window).apply(rolling_percentile)
Additional notes on the function: There are other ways to do this, but within the function I create a rolling
column on subset x
of the initial dataframe. Since for each x
a window is passed with n amount of previous values. For example if you window is of three, a numpy array will be passed looking a little like this : [1, 15, 27]
.
Hence, the rolling percentage that interests us is the one of the last value of x
relative to the values contained within the window. Therefore we get that value at position [-1, 1] which corresponds to the rolling
column of the last value.