I have a DataFrame df
with one time series variable, call it X
.
I can find the rolling mean over the last (say) n
observations with
df['rolling_mean'] = df.X.rolling(window=n).mean()
Now I want a new column with a boolean value that applies the following logic to each row and returns true if either:
X > rolling_mean
and sometime within the next k
observations, X
attains a value less than rolling_mean
X < rolling_mean
and sometime within the next k
observations, X
attains a value greater than rolling_mean
.My current workaround looks something like:
new_column = []
for i, row in df.iterrows():
next_k = df[i:i + k]
if row.X < row.rolling_mean and any(next_k.X > row.rolling_mean):
new_column.append(1)
elif row.X > row.rolling_mean and any(next_k.X < row.rolling_mean):
new_column.append(1)
else:
new_column.append(0)
df['new_column'] = new_column
But obviously this is iterative and not fast enough for a large dataset.
Is there a fast/vectorized way of doing this?
A nice solution I came up with:
df['rolling_max'] = df.X.rolling(k, min_periods=0).max().shift(-k)
df['rolling_min'] = df.X.rolling(k, min_periods=0).min().shift(-k)
df['will_drop_below_mean'] = (df.X > df.rolling_mean) & (df.rolling_min < df.rolling_mean)
df['will_rise_above_mean'] = (df.X < df.rolling_mean) & (df.rolling_max > df.rolling_mean)
df['new_column'] = ((df._will_drop_below_mean) | (df._will_rise_above_mean)).astype(int)
Basically using rolling().max()
and rolling().min()
with .shift()
to create two boolean masks.