Conditional imputation with average of non-missing columns with pandas toolbox

This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.

In SPSS there is function MEAN.n which gives you the mean value of list of numbers only when n elements of that list are valid (not pandas.NA). With that function you are able to imputat missing values only if a minimum number of items are valid.

Are there pandas function to do this with?

Example

Values [1, 2, 3, 4, NA]. Mean of the valid values is 2.5. The resulting list should be [1, 2, 3, 4, 2.5].

Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.

Values [1, 2, NA, NA, NA]. Mean of the valid values is 1.5 but it does not matter. The resulting list should not be changed [1, 2, NA, NA, NA] because imputation is not allowed.

Solution

Assuming you want to work with pandas, you can define a custom wrapper (using only pandas functions) to fillna with the mean only if a minimum number of items are not NA:

from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])

def fillna_mean(s, N=4):
    return s if s.notna().sum() < N else s.fillna(s.mean())

fillna_mean(s1)
# 0    1.0
# 1    2.0
# 2    3.0
# 3    4.0
# 4    2.5
# dtype: float64

fillna_mean(s2)
# 0       1
# 1       2
# 2    <NA>
# 3    <NA>
# 4    <NA>
# dtype: object

fillna_mean(s2, N=2)
# 0    1.0
# 1    2.0
# 2    1.5
# 3    1.5
# 4    1.5
# dtype: float64