I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame({
"review_num": [2,2,2,1,1,1,1,1,3],
"review": ["The second review","The second review","The second review",
"This is the first review","This is the first review",
"This is the first review","This is the first review",
"This is the first review",'No'],
"token_num":[1,2,3,1,2,3,4,5,1],
"token":["The","second","review","This","is","the","first","review","No"],
"score":[0.3,-0.6,0.4,0.5,0.8,-0.7,0.6,0.4,0.3]
})
review_num review token_num token score
0 2 The second review 1 The 0.3
1 2 The second review 2 second -0.6
2 2 The second review 3 review 0.4
3 1 This is the first review 1 This 0.5
4 1 This is the first review 2 is 0.8
5 1 This is the first review 3 the -0.7
6 1 This is the first review 4 first 0.6
7 1 This is the first review 5 review 0.4
8 3 No 1 No 0.3
I need to get the lines as below:
With this code, I respect 1 and 3 but I don't see how to respect the second rule without perturbing the third rule.
prelist=df['token'].str.contains('|'.join(['t','r']))
token_max_score = df[prelist].groupby('review_num', sort=False)['score'].idxmax()
Current result:
review_num
2 2
1 6
Expected result :
review_num
2 2
1 6
3 8
Use:
# rows with t/r in token
m = df['token'].str.contains('r|t')
# identify reviews with no match
m2 = (~m).groupby(df['review_num']).transform('all')
# for each group get idxmax
df[m|m2].groupby('review_num', sort=False)['score'].idxmax()
Output:
review_num
2 2
1 6
3 8
Name: score, dtype: int64
You can use a custom groupby.apply
:
(df.groupby('review_num', sort=False)
.apply(lambda g: g['score'].idxmax()
if set(g['review'].iloc[0]).intersection(['t', 'r'])
else g.sample(n=1).index[0])
)
Example output:
review_num
2 2
1 3
3 8
dtype: int64
Logic:
set
and compare it to t/r, if the intersection is not empty pick the idxmax