Search code examples
pythonpandasdataframepattern-matchingpandas-rolling

filter for specific sequences involving multiple columns and surrounding rows


I have data that looks like this:

enter image description here

It's standard financial price data (open, high, low, close).

In addition, I run some calculations. 'major_check' occasionally returns 1 or 2 (which 'minor_check' will then also return). 'minor_check' also returns 1 or 2, but more frequently. the rest is filled with 0 or NaN.

I'd like to test for specific patterns:

  • Whenever there is a 2 in 'major_check', I want to see if I can find a 21212 pattern in 'minor_check', with 21 preceding the central 2 and 12 following it.
  • If there is a 1 in 'major_check', I'd like to find a 12121 pattern in 'minor_check'

I highlighted a 21212 pattern in the screenshot to give a better idea on what I am looking for.

Once the 21212 or 12121 patterns are found, I'll check if specific rules applied on open/high/low/close (corresponding to the 5 rows constituting the pattern) are met or not.

Of course, one could naively iterate through the dataframe but this doesn't sound like the Pythonic way to do it.

I didn't manage to find a good way to do this, since a 21212 pattern can have some 0s inside it


Solution

  • As this answer by Timeless looked surprisingly complex, here is a quite simpler one.

    Method:

    • Temporarily remove rows that are empty of test results (effectively skipping NaN and None),
    • Search for patterns
      • either with numpy.where and pandas.shift to check for patterns row-wise (faster),
      • or preferrably with pandas.rolling -probably faster, more compact, but still readable.
    • Finally report the findings into the original df.

    You haven't specified how to flag the findings. Here they get marked as a True in two new columns, one for each pattern, appended to the original dataframe, for whatever use you would plan for them. They are called "hit1" and "hit2".

    Input data
    No text input data in your post, so until then I came up with my own. It is designed to produce one hit for each pattern:

    • "minor test" 12121 + "major test" 1 at index 14
    • "minor test" 21212 + "major test" 2 at index 12
    import pandas as pd
    import numpy as np
    
    # Start dataframe
    df = pd.DataFrame({'minor': [None,2,   1,   None,None,2,   None,2,   None,None,None,
                                 1,   2,   None,1,   None,None,2,   None,1,   2,   None,None],
                       'major': [None,None,None,None,None,2,   None,None,None,None,None,
                                 None,2,   None,1,   None,None,None,   None,None,None,None,None]})
    df
    
        minor  major
    0     NaN    NaN
    1     2.0    NaN
    2     1.0    NaN
    3     NaN    NaN
    4     NaN    NaN
    5     2.0    2.0
    6     NaN    NaN
    7     2.0    NaN
    8     NaN    NaN
    9     NaN    NaN
    10    NaN    NaN
    11    1.0    NaN
    12    2.0    2.0
    13    NaN    NaN
    14    1.0    1.0
    15    NaN    NaN
    16    NaN    NaN
    17    2.0    NaN
    18    NaN    NaN
    19    1.0    NaN
    20    2.0    NaN
    21    NaN    NaN
    22    NaN    NaN
    

    Locate hits
    Skip rows without test results

    # Remove rows without minor test result
    df1 = df.dropna(axis=0,subset='minor').copy()
    # No reset_index because we'll use it to report back to df.
    
    # Patterns of 'minor test' 
    minor_pat1 = [1,2,1,2,1]
    minor_pat2 = [2,1,2,1,2]
    

    Pattern search:

    • alternative 1: .shift()and np.where
    # Deploy shift columns, looking 2 values backwards and 2 forward
    for i in range(-2,3):              # i in [-2,-1, 0, 1, 2]
        df1[i] = df1['minor'].shift(i) # create a column named i
    
    # Test for both patterns and major test value
    df1['hit1'] = np.where(df1['major']==1, # case 12121
                           df1[list(range(-2,3))].eq(minor_pat1).all(axis=1),
                           False)
    df1['hit2'] = np.where(df1['major']==2, # case 21212
                           df1[list(range(-2,3))].eq(minor_pat2).all(axis=1),
                           False)
    df1 # Temporary dataframe with findings located
    
        minor  major   -2   -1    0    1    2   hit1   hit2
    1     2.0    NaN  2.0  1.0  2.0  NaN  NaN  False  False
    2     1.0    NaN  2.0  2.0  1.0  2.0  NaN  False  False
    5     2.0    2.0  1.0  2.0  2.0  1.0  2.0  False  False
    7     2.0    NaN  2.0  1.0  2.0  2.0  1.0  False  False
    11    1.0    NaN  1.0  2.0  1.0  2.0  2.0  False  False
    12    2.0    2.0  2.0  1.0  2.0  1.0  2.0  False   True
    14    1.0    1.0  1.0  2.0  1.0  2.0  1.0   True  False
    17    2.0    NaN  2.0  1.0  2.0  1.0  2.0  False  False
    19    1.0    NaN  NaN  2.0  1.0  2.0  1.0  False  False
    20    2.0    NaN  NaN  NaN  2.0  1.0  2.0  False  False
    
    • alternative 2: one-liner with rolling, preferred:

    .rolling() was designed for that purpose exactly.
    Just too bad they haven't implemented .rolling().eq() yet (list of window functions).
    This is why we must resort to apply .eq() from inside a lambda function.

    df1['hit1'] = (df1['major']==1) & (df1['minor']
                                       .rolling(window=5, center=True)
                                       .apply(lambda x : x.eq(minor_pat1).all()))
    
    df1['hit2'] = (df1['major']==2) & (df1['minor']
                                       .rolling(window=5, center=True)
                                       .apply(lambda x : x.eq(minor_pat2).all()))
    
        minor  major   hit1   hit2
    1     2.0    NaN  False  False
    2     1.0    NaN  False  False
    5     2.0    2.0  False  False
    7     2.0    NaN  False  False
    11    1.0    NaN  False  False
    12    2.0    2.0  False   True
    14    1.0    1.0   True  False
    17    2.0    NaN  False  False
    19    1.0    NaN  False  False
    20    2.0    NaN  False  False
    

    Finally report back to original df

    df.loc[df1.index,'hit1'] = df1['hit1']
    df.loc[df1.index,'hit2'] = df1['hit2']
    df
    
        minor  major   hit1   hit2
    0     NaN    NaN    NaN    NaN
    1     2.0    NaN  False  False
    2     1.0    NaN  False  False
    3     NaN    NaN    NaN    NaN
    4     NaN    NaN    NaN    NaN
    5     2.0    2.0  False  False
    6     NaN    NaN    NaN    NaN
    7     2.0    NaN  False  False
    8     NaN    NaN    NaN    NaN
    9     NaN    NaN    NaN    NaN
    10    NaN    NaN    NaN    NaN
    11    1.0    NaN  False  False
    12    2.0    2.0  False   True
    13    NaN    NaN    NaN    NaN
    14    1.0    1.0   True  False
    15    NaN    NaN    NaN    NaN
    16    NaN    NaN    NaN    NaN
    17    2.0    NaN  False  False
    18    NaN    NaN    NaN    NaN
    19    1.0    NaN  False  False
    20    2.0    NaN  False  False
    21    NaN    NaN    NaN    NaN
    22    NaN    NaN    NaN    NaN