Search code examples
pythonpandasgroup-bycumulative-sum

Cumulative mean based on condition with value duplicates


I have a DataFrame with the following structure:

df = pd.DataFrame({
    "item": [1, 1, 1, 1, 2, 2, 2],
    "order": [1, 2, 2, 3, 1, 2, 3],
    "rating": [3, 2, 1, 5, 5, 2, 3]
})

I want to calculate the cumulative mean rating of each item, depending on order. The problem is that there can be duplicates of order. The cumulative mean should only be calculated for all rows where order is smaller than in the current row. So the current order should not be factored in the cumulative mean. For all rows where order is equal, the cumulative mean should be the same. The result should look like the following:

item order rating cum_mean
1 1 3 np.nan
1 2 2 3
1 2 1 3
1 3 5 2
2 1 5 np.nan
2 2 2 5
2 3 3 3.5

I currently have a workaround, converting the dataframe to a dictionary of the items as keys and the columns (order, rating) as lists, iterating through each item and than converting it back to a df, like this for a single item:

prev_mean_list = []
i = 0
while i < len(_item["rating"]):
    cur_index = _item["order"][i]

    if i == 0:
        prev_mean = np.nan
    elif cur_index == _item["order"][i-1]:
        prev_mean = prev_mean_list[-1]
    else:
        prev_mean = statistics.mean(_item["rating"][0:i])

    prev_mean_list.append(prev_mean)

    i += 1

_item["prev_mean_order"] = prev_mean_list

The solution works but is quite slow as the full dataframe consists of more than 2 million rows. I tried to make a pandas native solution with groupBy and expanding but failed.

Can you suggest a solution with a better runtime?


Solution

  • Create custom function with numpy broadcasting for filter rows less like actual value, pass to numpy.where and get mean by numpy.nanmean:

    def f(x):
        a = x['order'].to_numpy()
        x['cum_mean'] = np.nanmean(np.where(a < a[:, None], x['rating'], np.nan), axis=1)
        return x
    
    df = df.groupby('item').apply(f)
    print (df)
       item  order  rating  cum_mean
    0     1      1       3       NaN
    1     1      2       2       3.0
    2     1      2       1       3.0
    3     1      3       5       2.0
    4     2      1       5       NaN
    5     2      2       2       5.0
    6     2      3       3       3.5