Search code examples
pythonpandasdataframegroup-by

group by with conditions python keeping all lines


I have the following pandas dataframe:

import pandas as pd
df = pd.DataFrame({
    "review_num": [2,2,2,1,1,1,1,1,3],
    "review": ["The second review","The second review","The second review",
               "This is the first review","This is the first review",
               "This is the first review","This is the first review",
               "This is the first review",'No'],
    "token_num":[1,2,3,1,2,3,4,5,1],
    "token":["The","second","review","This","is","the","first","review","No"],
    "score":[0.3,-0.6,0.4,0.5,0.8,-0.7,0.6,0.4,0.3]
})

   review_num                    review  token_num   token  score
0           2         The second review          1     The    0.3
1           2         The second review          2  second   -0.6
2           2         The second review          3  review    0.4
3           1  This is the first review          1    This    0.5
4           1  This is the first review          2      is    0.8
5           1  This is the first review          3     the   -0.7
6           1  This is the first review          4   first    0.6
7           1  This is the first review          5  review    0.4
8           3                        No          1      No    0.3

I need to get the lines as below:

  1. If the review contains "t" or "r": get the review line with the max score (just for lines with token containing t or r)
  2. If the review doesn't contain "t" or "r": get just one line of the review
  3. Keep the order of reviews as the order in the original table

With this code, I respect 1 and 3 but I don't see how to respect the second rule without perturbing the third rule.

prelist=df['token'].str.contains('|'.join(['t','r']))

token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()

Current result:

review_num
2    2
1    6

Expected result :

review_num
2    2
1    6
3    8

Solution

  • Use:

    # rows with t/r in token
    m = df['token'].str.contains('r|t')
    # identify reviews with no match
    m2 = (~m).groupby(df['review_num']).transform('all')
    
    # for each group get idxmax
    df[m|m2].groupby('review_num', sort=False)['score'].idxmax()
    

    Output:

    review_num
    2    2
    1    6
    3    8
    Name: score, dtype: int64
    

    previous answer

    You can use a custom groupby.apply:

    (df.groupby('review_num', sort=False)
       .apply(lambda g: g['score'].idxmax()
              if set(g['review'].iloc[0]).intersection(['t', 'r'])
              else g.sample(n=1).index[0])
    )
    

    Example output:

    review_num
    2    2
    1    3
    3    8
    dtype: int64
    

    Logic:

    • we group by "review_num", keeping the original order of the groups
    • for each group we convert the "review" to set and compare it to t/r, if the intersection is not empty pick the idxmax
    • else pick a random row