Search code examples
pythonpandasdataframedata-cleaning

clean dataframe by columns condition


I would like to keep only the rows of a Dataframe with the following condition: the intervals(included) in which the beginning condition is col1 = 0, col2 = 1 and the interval end col1 = 0, col2 = 2.

Sample data

import pandas as pd

pd.DataFrame({'id':['id1','id1','id1','id1','id1','id1','id1','id1','id1','id1','id1','id2','id2','id2','id2','id2']
                  ,'col1':[0,1,1,0,1,0,0,1,1,0,0,1,0,0,1,1],'col2':[1,2,2,1,2,2,1,2,2,2,1,2,2,1,2,2]})

This would look like this:

    col1 col2 id
0   0   1   id1
1   1   2   id1
2   1   2   id1
3   0   1   id1
4   1   2   id1
5   0   2   id1
6   0   1   id1
7   1   2   id1
8   1   2   id1
9   0   2   id1
10  0   1   id1
11  1   2   id2
12  0   2   id2
13  0   1   id2
14  1   2   id2
15  1   2   id2

Output Sample

We can realise that there are only "blocks" or intervals with 0-1,0-2 in col1,col2.

   col1 col2 id
3   0   1   id1
4   1   2   id1
5   0   2   id1
6   0   1   id1
7   1   2   id1
8   1   2   id1
9   0   2   id1
10  0   1   id1
11  1   2   id2
12  0   2   id2

As a result rows 0,1,2,13,14,15 were erased because they weren't a in a 0-1 , 0-2 interval.


Solution

  • By using the new para group (drop it by using df.drop('group',1))


    Setting up

    df['group']=(df.col1==0)&(df.col2==1)
    df['group']=df['group'].cumsum()
    

    Option1

    mask=df.groupby('group').apply(lambda x : sum((x.col1==0)&(x.col2==2)))
    df.loc[df.group.isin(mask[mask.eq(1)].index)]
    
    
    Out[363]: 
        col1  col2   id  group
    3      0     1  id1      2
    4      1     2  id1      2
    5      0     2  id1      2
    6      0     1  id1      3
    7      1     2  id1      3
    8      1     2  id1      3
    9      0     2  id1      3
    10     0     1  id1      4
    11     1     2  id2      4
    12     0     2  id2      4
    

    Option2 case mention by
    @Bharathshetty

    mask=df.groupby('group').last().loc[lambda x : (x.col1==0)&(x.col2==2),].index
    df.loc[df.group.isin(mask)]
    
    
    Out[379]: 
        col1  col2   id  group
    3      0     1  id1      2
    4      1     2  id1      2
    5      0     2  id1      2
    6      0     1  id1      3
    7      1     2  id1      3
    8      1     2  id1      3
    9      0     2  id1      3
    10     0     1  id1      4
    11     1     2  id2      4
    12     0     2  id2      4