Search code examples
pandasspss

Conditional imputation with average of non-missing columns with pandas toolbox


This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.

In SPSS there is function MEAN.n which gives you the mean value of list of numbers only when n elements of that list are valid (not pandas.NA). With that function you are able to imputat missing values only if a minimum number of items are valid.

Are there pandas function to do this with?

Example

Values [1, 2, 3, 4, NA]. Mean of the valid values is 2.5. The resulting list should be [1, 2, 3, 4, 2.5].

Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.

Values [1, 2, NA, NA, NA]. Mean of the valid values is 1.5 but it does not matter. The resulting list should not be changed [1, 2, NA, NA, NA] because imputation is not allowed.


Solution

  • Assuming you want to work with pandas, you can define a custom wrapper (using only pandas functions) to fillna with the mean only if a minimum number of items are not NA:

    from pandas import NA
    s1 = pd.Series([1, 2, 3, 4, NA])
    s2 = pd.Series([1, 2, NA, NA, NA])
    
    def fillna_mean(s, N=4):
        return s if s.notna().sum() < N else s.fillna(s.mean())
    
    fillna_mean(s1)
    # 0    1.0
    # 1    2.0
    # 2    3.0
    # 3    4.0
    # 4    2.5
    # dtype: float64
    
    fillna_mean(s2)
    # 0       1
    # 1       2
    # 2    <NA>
    # 3    <NA>
    # 4    <NA>
    # dtype: object
    
    fillna_mean(s2, N=2)
    # 0    1.0
    # 1    2.0
    # 2    1.5
    # 3    1.5
    # 4    1.5
    # dtype: float64