I have Amazon reviews dataset that includes reviewer_id and review_date. For each reviewer, I need to rank his reviews based on their arrival date.
I will explain the problem I am facing with the example below. Reviewer aaa posted 5 reviews, two of them arrived on the same date. The ones that arrived on the same date should have the same ranking (2 in this example). The next review that arrived on 1/3/2000 should be ranked 4 not 3 because it is review #4 for reviewer aaa. All the codes I tried will continue the ranking based on the last rank number which is not what I need.
Same thing for reviewer bbb, the two reviews that were posted on 9/10/2010 should rank 1. Then the next review posted on 11/10/2010 should rank 3.
Thank you in advance!
reviewer_id | review_date | rank |
---|---|---|
aaa | 1/1/2000 | 1 |
aaa | 1/2/2000 | 2 |
aaa | 1/2/2000 | 2 |
aaa | 1/3/2000 | 4 |
aaa | 5/9/2002 | 5 |
bbb | 9/10/2010 | 1 |
bbb | 9/10/2010 | 1 |
bbb | 11/10/2010 | 3 |
bbb | 12/10/2010 | 4 |
here is the code I tried with no sucess:
df['rank']=df.groupby('reviewer_id').review_date.transform(lambda x : pd.factorize(x)[0]+1)
This code will produce the following rank column which is not what I need:
reviewer_id | review_date | rank |
---|---|---|
aaa | 1/1/2000 | 1 |
aaa | 1/2/2000 | 2 |
aaa | 1/2/2000 | 2 |
aaa | 1/3/2000 | 3 |
aaa | 5/9/2002 | 4 |
You can use groupby.rank
with the method='min'
parameter:
df['rank'] = (pd.to_datetime(df['review_date'])
.groupby(df['reviewer_id']).rank('min')
.convert_dtypes()
)
output:
reviewer_id review_date rank
0 aaa 1/1/2000 1
1 aaa 1/2/2000 2
2 aaa 1/2/2000 2
3 aaa 1/3/2000 4
4 aaa 5/9/2002 5
5 bbb 9/10/2010 1
6 bbb 9/10/2010 1
7 bbb 11/10/2010 3
8 bbb 12/10/2010 4