I would like to ask you about how to find the cumulative average of a specific column value in a pandas dataframe. First, the data looks like this:
firm | date | reviewer | rate |
---|---|---|---|
A | 2021-01-01 | a | 5 |
A | 2021-01-01 | b | 1 |
A | 2021-01-01 | c | 2 |
A | 2021-01-02 | d | 3 |
A | 2021-01-02 | e | 4 |
A | 2021-01-03 | f | 3 |
A | 2021-01-04 | g | 5 |
B | 2021-01-01 | h | 5 |
B | 2021-01-01 | i | 2 |
B | 2021-01-02 | j | 3 |
B | 2021-01-02 | k | 4 |
B | 2021-01-03 | a | 3 |
B | 2021-01-04 | b | 5 |
What I want to find is to get the average rating of a specific company by date, and add a column to find the cumulative average rating including today's average rating.
I want to make it into a dataframe like the one below.
firm | date | reviewer | rate | cum_avg_rate |
---|---|---|---|---|
A | 2021-01-01 | a | 5 | 2.667 |
A | 2021-01-01 | b | 1 | 2.667 |
A | 2021-01-01 | c | 2 | 2.667 |
A | 2021-01-02 | d | 3 | 3 |
A | 2021-01-02 | e | 4 | 3 |
A | 2021-01-03 | f | 3 | 3 |
A | 2021-01-04 | g | 5 | 3.286 |
B | 2021-01-01 | h | 5 | 3.5 |
B | 2021-01-01 | i | 2 | 3.5 |
B | 2021-01-02 | j | 3 | 3.5 |
B | 2021-01-02 | k | 4 | 3.5 |
B | 2021-01-03 | a | 3 | 3.4 |
B | 2021-01-04 | b | 5 | 3.667 |
The method I have tried so far is to create a new data frame that calculates the average score and the number of reviews using the 'groupby' method with firm and date, and use this to create a cumulative average for each day.
The code is below.
firm_gp=avg_mean_rate.groupby(['firm','date'])['mean']
firm_gp_count=avg_mean_rate.groupby(['firm','date'])['count']
avg_mean_rate['new_avg_grade']=( (firm_gp * firm_gp_count).cumsum())/firm_gp_count.cumsum()
However, the problem is that the following error occurs in the process of calculating the cumulative average for each day.
TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'method'
As the second method, I tried the following method using numpy.
def w_cum_avg(avg_mean_rate,mean,count):
d=avg_mean_rate['mean']
w= avg_mean_rate['count']
return(d*w).cumsum() / w.cumsum()
avg_mean_rate.groupby(['firm','date']).apply(w_cum_avg,'mean','count')
But this doesn’t work well, what I expected.
I would appreciate it if you could teach me how to get results.
Thank you in advance.
We could compute the daily sum
and count
per firm
with groupby aggregate
then groupby cumsum
to get the daily cumulative total per firm
. Compute the mean by dividing and join
back to the DataFrame:
g = (
df.groupby(['firm', 'date'])['rate']
.agg(['sum', 'count'])
.groupby(level='firm').cumsum()
)
df = df.join(
g['sum'].div(g['count']).rename('cum_avg_rate'),
on=['firm', 'date'] # align index on columns
)
df
:
firm date reviewer rate cum_avg_rate
0 A 2021-01-01 a 5 2.666667
1 A 2021-01-01 b 1 2.666667
2 A 2021-01-01 c 2 2.666667
3 A 2021-01-02 d 3 3.000000
4 A 2021-01-02 e 4 3.000000
5 A 2021-01-03 f 3 3.000000
6 A 2021-01-04 g 5 3.285714
7 B 2021-01-01 h 5 3.500000
8 B 2021-01-01 i 2 3.500000
9 B 2021-01-02 j 3 3.500000
10 B 2021-01-02 k 4 3.500000
11 B 2021-01-03 a 3 3.400000
12 B 2021-01-04 b 5 3.666667
Setup:
import pandas as pd
df = pd.DataFrame({
'firm': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'date': ['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02',
'2021-01-02', '2021-01-03', '2021-01-04', '2021-01-01',
'2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03',
'2021-01-04'],
'reviewer': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'a',
'b'],
'rate': [5, 1, 2, 3, 4, 3, 5, 5, 2, 3, 4, 3, 5]
})