Search code examples
pythonpandasseries

find a sequence of values within a pandas series


I am looking for the best way to find a sequence of values of varying lengths within a longer pandas Series. For example, I have the values [92.6, 92.7, 92.9] (but could also be length 2 or 5) and would like to find all the cases where this exact sequence occurs within the longer Series

s = pd.Series([92.6,92.7,92.9,24.2,24.3,25.1,24.9,25.1,24.9,97.6,94.5,1.0,92.6,92.7,92.9,97.9,96.8,96.4,92.8,92.8,93.1,89.5,89.6])

(actual series is approx length 1000).

In this example the correct result should be indices 0,1,2 and 12,13,14.


Solution

  • Using rolling to identify the last row of each stretch:

    target = [92.6, 92.7, 92.9]
    
    m = s.rolling(len(target)).apply(lambda x: x.eq(target).all())
    out = m[m.eq(1)].index
    

    Output: [2, 14]

    For all indices:

    out = [x for end in m[m.eq(1)].index for x in range(end-len(target)+1, end+1)]
    

    Output:

    [0, 1, 2, 12, 13, 14]
    

    Alternatively, using 's sliding_window_view, giving the starting indices:

    from numpy.lib.stride_tricks import sliding_window_view as swv
    
    out = np.where((swv(s, len(target)) == target).all(axis=1))[0]
    

    Output: array([ 0, 12])

    For all indices:

    out2 = (np.linspace(out[:,None], out[:,None]+len(target)-1, len(target))
              .ravel('F').astype(int)
            )
    

    Output: array([ 0, 1, 2, 12, 13, 14])