Search code examples
pandasdataframesubsetmask

subset df by masking between specific rows


I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.

Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.

Note: The string ABC will be followed by different values.

I thought about returning all the indexes that contain these two values.

But mask could work better if I could return all rows between these two values?

df = pd.DataFrame({
    'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],                                                   
    })

mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)   

Intended Output:

     Val
0   None
8      X
9      1
10     2
15     Y
16     1
17     2

Solution

  • If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:

    #2 values of ABC, 5 in data
    df = pd.DataFrame({
       'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
                'None','ABC','None',1,2,3,4,5,'None','None','None']
        })
    
    m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
    m2 = np.where(df['Val'] == 5)[0]
    
    print (m1)
    [ 1 12]
    
    print (m2)
    [ 7 18]
    
    idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
    print (df[~df.index.isin(idx)])
         Val
    0   None
    8      X
    9      1
    10     2
    11  None
    19     X
    20     1
    21     2