Search code examples
pythonpandasgrouping

calculate expanding average per group excluding current


Objective: Calculate the expanding mean of the 'valuation' column for each 'slug' group, excluding the current row's value (and respecting ordering i.e. 'week').

Example dataset (output is the desired result).

idx week    slug    valuation   output
0   2   slouk   -4  12.00
1   3   slouk   7   4.00
2   4   slouk   8   5.00
3   3   kenun   10  14.00
4   1   kenun   11  
5   1   slouk   12  
6   2   kenun   17  11.00
7   4   kenun   21  12.67

I have tried (and failed) :

# chaining
td["output"] = (
    td.sort_values(by="week")
    .groupby("slug")["valuation"]
    .shift()
    .expanding()
    .mean()
    .reset_index(drop=True)
)

# apply
td["output"] = (
    td.sort_values(by="week")
    .groupby("slug")["valuation"]
    .apply(lambda x: x.shift().expanding().mean())
    .reset_index(drop=True)
)

Then I stumbled across this related topic and the .sort_index(level=1) did the trick.

So now, this is works :

td["output"] = (
    td.sort_values(by="week")
    .groupby("slug")["valuation"]
    .apply(lambda x: x.shift().expanding().mean())
    .sort_index(level=1)
    .reset_index(drop=True)
)

The "apply" version works, though is rather slow on a large dataset. When I try the .sort_index(level=1) with the "chaining" version, this still does not work.

Although it works, I still have several points I'd like to understand better :

  • Chaining Methods Issue: When I use method chaining like td.groupby('slug')['valuation'].shift().expanding().mean(), it seems to lose track of the grouping after a certain point. I'm puzzled about why this happens and how it differs from using apply.

  • Indexing and Assignment with apply: Using apply with a lambda function performs the operation correctly, but the resulting Series has a different order than my original DataFrame. When I try to assign it back using .reset_index(drop=True), things don't align properly. I'm seeking clarity on how this reassignment process works.

  • Is there any other, more efficient way to do what i am after?

My goal is to build a better understanding of these methods.


Solution

  • use transform instead apply. This should give you some speedup. And sort_index and reset_index are not needed to create columns.

    td['output'] = (
        td.sort_values(by="week")
        .groupby("slug")["valuation"]
        .transform(lambda x: x.shift().expanding().mean())
    )
    

    To answer your further question, you can see that td.groupby('slug')['valuation'].shift() is not a groupby object, it's just a series. So adding .expanding().mean() doesn't do any grouping operations. If you don't use lambda functions, you'll need to perform the groupby operation one more time with code like this td.groupby('slug')['valuation'].shift().groupby('slug').expanding().mean()