I've been looking to retrieve group-based expanding means from the following dataset:
df = pd.DataFrame({'id':[1,1,1,2,2,2],'y':[1,2,3,1,2,3]})
and df.groupby('id').expanding().mean().values
returns the correct:
array([[1. ],
[1.5],
[2. ],
[1. ],
[1.5],
[2. ]])
However, in my specific case I have to deal with some missing values as well, so that:
df2 = pd.DataFrame({'id':[1,1,1,2,2,2],'y':[1,pd.NA,3,1,2,3]})
My expected result applying the same logic would be to ignore the NaN in the computation of the mean, so that from df2.groupby('id').expanding().mean().values
I would expect
array([[1. ],
[1.],
[2. ],
[1. ],
[1.5],
[2. ]])
Instead, Pandas returns an error due to applying some type assertion to float in the backend. None of my naive attempts (e.g., .expanding().apply(lambda x: np.nansum(x)
) are solving this. Any (possibly equally compact) solution?
You can convert column 'y'
with pd.to_numeric
, which will coerce pd.NaN
into nan
. The latter can be interpreted correctly by the following operations:
df2["y"] = pd.to_numeric(df2["y"])
df2 = df2.groupby("id").expanding().mean().values
[[1. ]
[1. ]
[2. ]
[1. ]
[1.5]
[2. ]]