Search code examples
pythonpython-3.xpandasdataframesklearn-pandas

Multiple slicing rows in pandas on a cell condition


Msgtype Date ConvID   message
enquire 12/1 689  I want your car
reply   12/3 689  it is available
reply   12/4 689  rent please?
reply   12/6 689  $200
accept  12/8 689  please pay through CC
reply   12/8 689  thank you, what about fuel?
reply   12/8 689  you have to take care
enquire 12/3 690  Looking for car
reply   12/4 690  available
accept  12/5 690  paid
reply   12/6 690  thank you

I want to group this data by ConvID and sort it by date. I want the rows till "Msgtype" = accept for that particular ConvID. Aim to analyze message data till booking request is accepted for particular ConvID. so for ConvID = 689, I want rows till "Msgtype" = accept. Rest of the rows after "accept" are not required.

Eg: These two are not required for ConvID = 689

    Msgtype Date ConvID   message
    reply   12/8 689  thank you, what about fuel?
    reply   12/8 689  you have to take care

Similarly this row is not required for ConvID = 690

Msgtype Date ConvID   message
 reply   12/6 690  thank you

Solution

  • I think you can use:

    mask1 = (df.Msgtype == 'accept')
    mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0
    
    print (df[mask].sort_values(['ConvID','Date']))
       Msgtype  Date  ConvID                message
    0  enquire  12/1     689        I want your car
    1    reply  12/3     689        it is available
    2    reply  12/4     689           rent please?
    3    reply  12/6     689                   $200
    4   accept  12/8     689  please pay through CC
    7  enquire  12/3     690        Looking for car
    8    reply  12/4     690              available
    9   accept  12/5     690                   paid
    

    Explanations:

    #mask where is 'accept'
    mask1 = (df.Msgtype == 'accept')
    print (mask1)
    0     False
    1     False
    2     False
    3     False
    4      True
    5     False
    6     False
    7     False
    8     False
    9      True
    10    False
    Name: Msgtype, dtype: bool
    
    #per group shift, replace NaN by False and cumulative sum
    print (mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()))
    0     0
    1     0
    2     0
    3     0
    4     0
    5     1
    6     1
    7     0
    8     0
    9     0
    10    1
    Name: Msgtype, dtype: int32
    
    #where output of groupby is 0 
    mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0
    print (mask)
    0      True
    1      True
    2      True
    3      True
    4      True
    5     False
    6     False
    7      True
    8      True
    9      True
    10    False
    Name: Msgtype, dtype: bool
    
    #boolean indexing and sorting
    print (df[mask].sort_values(['ConvID','Date']))
       Msgtype  Date  ConvID                message
    0  enquire  12/1     689        I want your car
    1    reply  12/3     689        it is available
    2    reply  12/4     689           rent please?
    3    reply  12/6     689                   $200
    4   accept  12/8     689  please pay through CC
    7  enquire  12/3     690        Looking for car
    8    reply  12/4     690              available
    9   accept  12/5     690                   paid