Search code examples
pandasrolling-computation

Pandas Rolling Filter


I have a df that looks like this for example:

df = pd.DataFrame({'Change': [-0.5, -0.25, 1, 1.5, 0.25,-0.75, -1, 2, .... })

What I'm looking to do is to filter this df['Change'] column for above and below 0, calculate the mean for both, and then add the two results together. But I want to be able to do this on a rolling basis over a window of 100.

I can do this in a very blunt way by just looping through the df as I do in the below:

df['diff'] = np.NAN
for i in range(0, len(df)):
    if i > 100:
        X = df.iloc[i - 100:i]
        ups = X[X['change'] > 0]['change'].mean()
        downs = X[X['change'] < 0]['change'].mean()
        df['diff'].iloc[i] = (ups + downs)

However, it's quite slow and was wondering if there was a more efficient way of doing this?


Solution

  • You can use df.rolling(window=100).apply(foo) where foo is your function. for example:

    import pandas as pd
    import numpy as np
    from time import monotonic
    
    df = pd.DataFrame({'Change': np.random.randint(-5, 5, size=(10000,))})
    
    def foo(x):
        ups = x[x > 0].mean()
        downs = x[x < 0].mean()
        return ups + downs
    
    
    start = monotonic()
    res = df.Change.rolling(100).apply(foo)
    print(f'time took: {monotonic() - start:.3f} s.')
    
    Out:
        time took: 2.544 s.
    

    With raw=True keyword argument it will be faster

    start = monotonic()
    res = df.Change.rolling(100).apply(foo, raw=True)
    print(f'with raw=True time took: {monotonic() - start:.3f} s.')
    
    
    Out:
         with raw=True time took: 0.107 s.
    

    Also yoy can parallelize apply method with parallel-pandas library. It's very simple:

    import pandas as pd
    import numpy as np
    from time import monotonic
    from parallel_pandas import ParallelPandas
    
    ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)
    
    df = pd.DataFrame({'Change': np.random.randint(-5, 5, size=(10000,))})
    
    
    def foo(x):
        ups = x[x > 0].mean()
        downs = x[x < 0].mean()
        return ups + downs
    #p_apply - is parallel analogue of apply method
    start = monotonic()
    res = df.Change.rolling(100).p_apply(foo, raw=True)
    print(f'parallel with raw=True time took: {monotonic() - start:.3f} s.')
    
    Out:
        parallel with raw=True time took: 0.047 s.
    

    Total speedup is 2.5 / 0.05 = 50!