Search code examples
pythonpandasdataframevectorization

Pandas Dataframe check relation of variable and rolling mean


I have a DataFrame df with one time series variable, call it X.

I can find the rolling mean over the last (say) n observations with

df['rolling_mean'] = df.X.rolling(window=n).mean()

Now I want a new column with a boolean value that applies the following logic to each row and returns true if either:

  • X > rolling_mean and sometime within the next k observations, X attains a value less than rolling_mean
  • or X < rolling_mean and sometime within the next k observations, X attains a value greater than rolling_mean.

My current workaround looks something like:

new_column = []
for i, row in df.iterrows():
    next_k = df[i:i + k]
    if row.X < row.rolling_mean and any(next_k.X > row.rolling_mean):
        new_column.append(1)
    elif row.X > row.rolling_mean and any(next_k.X < row.rolling_mean):
        new_column.append(1)
    else:
        new_column.append(0)

df['new_column'] = new_column

But obviously this is iterative and not fast enough for a large dataset.

Is there a fast/vectorized way of doing this?


Solution

  • A nice solution I came up with:

    df['rolling_max'] = df.X.rolling(k, min_periods=0).max().shift(-k)
    df['rolling_min'] = df.X.rolling(k, min_periods=0).min().shift(-k)
    
    df['will_drop_below_mean'] = (df.X > df.rolling_mean) & (df.rolling_min < df.rolling_mean)
    df['will_rise_above_mean'] = (df.X < df.rolling_mean) & (df.rolling_max > df.rolling_mean)
    
    df['new_column'] = ((df._will_drop_below_mean) | (df._will_rise_above_mean)).astype(int)
    

    Basically using rolling().max() and rolling().min() with .shift() to create two boolean masks.