I have a DataFrame with the following structure:
df = pd.DataFrame({
"item": [1, 1, 1, 1, 2, 2, 2],
"order": [1, 2, 2, 3, 1, 2, 3],
"rating": [3, 2, 1, 5, 5, 2, 3]
})
I want to calculate the cumulative mean rating of each item, depending on order. The problem is that there can be duplicates of order. The cumulative mean should only be calculated for all rows where order is smaller than in the current row. So the current order should not be factored in the cumulative mean. For all rows where order is equal, the cumulative mean should be the same. The result should look like the following:
item | order | rating | cum_mean |
---|---|---|---|
1 | 1 | 3 | np.nan |
1 | 2 | 2 | 3 |
1 | 2 | 1 | 3 |
1 | 3 | 5 | 2 |
2 | 1 | 5 | np.nan |
2 | 2 | 2 | 5 |
2 | 3 | 3 | 3.5 |
I currently have a workaround, converting the dataframe to a dictionary of the items as keys and the columns (order, rating) as lists, iterating through each item and than converting it back to a df, like this for a single item:
prev_mean_list = []
i = 0
while i < len(_item["rating"]):
cur_index = _item["order"][i]
if i == 0:
prev_mean = np.nan
elif cur_index == _item["order"][i-1]:
prev_mean = prev_mean_list[-1]
else:
prev_mean = statistics.mean(_item["rating"][0:i])
prev_mean_list.append(prev_mean)
i += 1
_item["prev_mean_order"] = prev_mean_list
The solution works but is quite slow as the full dataframe consists of more than 2 million rows. I tried to make a pandas native solution with groupBy and expanding but failed.
Can you suggest a solution with a better runtime?
Create custom function with numpy broadcasting for filter rows less like actual value, pass to numpy.where
and get mean by numpy.nanmean
:
def f(x):
a = x['order'].to_numpy()
x['cum_mean'] = np.nanmean(np.where(a < a[:, None], x['rating'], np.nan), axis=1)
return x
df = df.groupby('item').apply(f)
print (df)
item order rating cum_mean
0 1 1 3 NaN
1 1 2 2 3.0
2 1 2 1 3.0
3 1 3 5 2.0
4 2 1 5 NaN
5 2 2 2 5.0
6 2 3 3 3.5