I have a data frame containing the customers ratings of the restaurants they went to and few other attributes.
data = {'rating_id': ['1', '2','3','4','5','6','7'],
'user_id': ['56', '13','56','99','99','13','12'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df.head()
df['star_rating'] = df['star_rating'].astype(float)
# calculate the average of the stars of the first year
ratings_mean_firstYear= df.groupby(['restaurant_id','first_year']).agg({'star_rating':[np.mean]})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_firstYear.reset_index()
# calculate the average of the stars of the last year
ratings_mean_lastYear= df.groupby(['restaurant_id','last_year']).agg({'star_rating':[np.mean]})
ratings_mean_lastYear.columns = ['avg_lastYear']
ratings_mean_lastYear.reset_index()
# merge the means into a single table
ratings_average = ratings_mean_firstYear.merge(
ratings_mean_lastYear.groupby('restaurant_id')['avg_lastYear'].max()
, on='restaurant_id'
)
ratings_average.head(20)
My problem is that the averages of the first and last years are the exact same and that makes no sens, i don't really know what i did wrong in my thought process here..i suspect something is going on with .agg
since it's the first time i use pandas lib.
Any suggestions?
Your data is provided in such a way that it has single rating per user/restaurant pair and you use it in both first and last year aggregation - so naturally it is equal for both years. I'd first filter the data using rating_year == first_year criteria and then apply groupby and agg. Then repeat same for the last year and then merge 2 results. In your example there is not a single review, whose data matches first or last year of any restaurant. So to show proper example would require more data. I assume that you have it in your larger dataframe. –
Here is an example, I added more lines and changed years to have more matches:
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '56','56','56', '99','99','99','99','99'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx', 'xxx','yyy','yyy','xxx'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','4.0','2.5','3.0'],
'rating_year': ['2012', '2020','2001','2020', '2012', '2020','2001','2020','2019'],
'first_year': ['2012', '2012','2001','2001','2012', '2012','2001','2001','2012'],
'last_year': ['2020', '2020','2020','2020','2020','2020','2020','2020','2020'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)
ratings_mean_firstYear = df[df.rating_year == df.first_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_lastYear= df[df.rating_year == df.last_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_lastYear.columns = ['avg_lastYear']
result:
ratings_mean_firstYear.merge(ratings_mean_lastYear, left_index=True, right_index=True)
avg_firstYear avg_lastYear
restaurant_id
xxx 1.65 3.45
yyy 2.60 3.75