I have a df
that looks like this for example:
df = pd.DataFrame({'Change': [-0.5, -0.25, 1, 1.5, 0.25,-0.75, -1, 2, .... })
What I'm looking to do is to filter this df['Change']
column for above and below 0, calculate the mean for both, and then add the two results together. But I want to be able to do this on a rolling basis over a window of 100.
I can do this in a very blunt way by just looping through the df
as I do in the below:
df['diff'] = np.NAN
for i in range(0, len(df)):
if i > 100:
X = df.iloc[i - 100:i]
ups = X[X['change'] > 0]['change'].mean()
downs = X[X['change'] < 0]['change'].mean()
df['diff'].iloc[i] = (ups + downs)
However, it's quite slow and was wondering if there was a more efficient way of doing this?
You can use df.rolling(window=100).apply(foo)
where foo
is your function.
for example:
import pandas as pd
import numpy as np
from time import monotonic
df = pd.DataFrame({'Change': np.random.randint(-5, 5, size=(10000,))})
def foo(x):
ups = x[x > 0].mean()
downs = x[x < 0].mean()
return ups + downs
start = monotonic()
res = df.Change.rolling(100).apply(foo)
print(f'time took: {monotonic() - start:.3f} s.')
Out:
time took: 2.544 s.
With raw=True
keyword argument it will be faster
start = monotonic()
res = df.Change.rolling(100).apply(foo, raw=True)
print(f'with raw=True time took: {monotonic() - start:.3f} s.')
Out:
with raw=True time took: 0.107 s.
Also yoy can parallelize apply
method with parallel-pandas library.
It's very simple:
import pandas as pd
import numpy as np
from time import monotonic
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)
df = pd.DataFrame({'Change': np.random.randint(-5, 5, size=(10000,))})
def foo(x):
ups = x[x > 0].mean()
downs = x[x < 0].mean()
return ups + downs
#p_apply - is parallel analogue of apply method
start = monotonic()
res = df.Change.rolling(100).p_apply(foo, raw=True)
print(f'parallel with raw=True time took: {monotonic() - start:.3f} s.')
Out:
parallel with raw=True time took: 0.047 s.
Total speedup is 2.5 / 0.05 = 50
!