Search code examples
pythonpandasdataframesubsampling

Pandas keep every nth row with special rule


For example, I want to keep every 3rd row, but I must keep numbers divisible by 3(or some special rule like that). When I see a number divisible by 3, that restarts the count, meaning I will start counting to 3 from there, unless I see anoter value divisible by 3. Example given below:

import pandas as pd
df = pd.DataFrame.from_dict({'x': [0, 1, 2, 3, 4, 5, 7, 8, 9, 11, 12, 13, 14, 17, 20, 23]})
filtered = pd.DataFrame.from_dict({'x': [0, 3,  7,  9,  12,  17]}) # this is the desired dataframe
print (df, '\n\n--------------\n\n', filtered)

     x
0    0
1    1
2    2
3    3
4    4
5    5
6    7
7    8
8    9
9   11
10  12
11  13
12  14
13  17
14  20
15  23 

--------------

     x
0   0
1   3
2   7
3   9
4  12
5  17

Solution

  • You can use a custom groupby.cumcount:

    # identify starts of groups
    m1 = df['x'].mod(3).eq(0)
    
    # for each group, get every third row
    m2 = (df.groupby(m1.cumsum())
            .cumcount().mod(3).eq(0)
          )
    
    out = df[m2]
    

    Output:

         x
    0    0
    3    3
    6    7
    8    9
    10  12
    13  17
    

    Intermediates:

         x     m1  m1.cumsum()  cumcount     m2
    0    0   True            1         0   True
    1    1  False            1         1  False
    2    2  False            1         2  False
    3    3   True            2         0   True
    4    4  False            2         1  False
    5    5  False            2         2  False
    6    7  False            2         3   True
    7    8  False            2         4  False
    8    9   True            3         0   True
    9   11  False            3         1  False
    10  12   True            4         0   True
    11  13  False            4         1  False
    12  14  False            4         2  False
    13  17  False            4         3   True
    14  20  False            4         4  False
    15  23  False            4         5  False