Search code examples
pandasdataframenumpytime-series

How to remove rows so that the values in a column match a sequence


I'm looking for a more efficient method to deal with the following problem. I have a Dataframe with a column filled with values that randomly range from 1 to 4, I need to remove all the rows of the data frame that do not follow the sequence (1-2-3-4-1-2-3-...). This is what I have:

              A     B
12/2/2022    0.02   2
14/2/2022    0.01   1
15/2/2022    0.04   4
16/2/2022   -0.02   3
18/2/2022   -0.01   2
20/2/2022    0.04   1
21/2/2022    0.02   3
22/2/2022   -0.01   1
24/2/2022    0.04   4
26/2/2022   -0.02   2
27/2/2022    0.01   3
28/2/2022    0.04   1
01/3/2022   -0.02   3
03/3/2022   -0.01   2
05/3/2022    0.04   1
06/3/2022    0.02   3
08/3/2022   -0.01   1
10/3/2022    0.04   4
12/3/2022   -0.02   2
13/3/2022    0.01   3
15/3/2022    0.04   1
      ...

This is what I need:

              A     B
14/2/2022    0.01   1
18/2/2022   -0.01   2
21/2/2022    0.02   3
24/2/2022    0.04   4
28/2/2022    0.04   1
03/3/2022   -0.01   2
06/3/2022    0.02   3
10/3/2022    0.04   4
15/3/2022    0.04   1
        ...

Since the data frame is quite big I need some sort of NumPy-based operation to accomplish this, the more efficient the better. My solution is very ugly and inefficient, basically, I made 4 loops like the following to check for every part of the sequence (4-1,1-2,2-3,3-4):

df_len = len(df)
df_len2 = 0
while df_len != df_len2:
    df_len = len(df)
    df.loc[(df.B.shift(1) == 4) & (df.B != 1), 'B'] = 0
    df = df[df['B'] != 0]
    df_len2 = len(df)

Solution

  • A simple improvement to speed this up is to not touch the dataframe within the loop, but just iterate over the values of B to construct a Boolean index, like this:

    is_in_sequence = []
    next_target = 1
    
    for b in df.B:
        if b == next_target:
            is_in_sequence.append(True)
            next_target = next_target % 4 + 1
        else:
            is_in_sequence.append(False)
            
    print(df[is_in_sequence])
    
                 A     B
    14/2/2022    0.01   1
    18/2/2022   -0.01   2
    21/2/2022    0.02   3
    24/2/2022    0.04   4
    28/2/2022    0.04   1
    03/3/2022   -0.01   2
    06/3/2022    0.02   3
    10/3/2022    0.04   4
    15/3/2022    0.04   1