This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.
In SPSS there is function MEAN.n
which gives you the mean value of list of numbers only when n
elements of that list are valid (not pandas.NA
). With that function you are able to imputat missing values only if a minimum number of items are valid.
Are there pandas function to do this with?
Example
Values [1, 2, 3, 4, NA]
.
Mean of the valid values is 2.5
.
The resulting list should be [1, 2, 3, 4, 2.5]
.
Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.
Values [1, 2, NA, NA, NA]
.
Mean of the valid values is 1.5
but it does not matter.
The resulting list should not be changed [1, 2, NA, NA, NA]
because imputation is not allowed.
Assuming you want to work with pandas
, you can define a custom wrapper (using only pandas functions) to fillna
with the mean
only if a minimum number of items are not NA:
from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])
def fillna_mean(s, N=4):
return s if s.notna().sum() < N else s.fillna(s.mean())
fillna_mean(s1)
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# 4 2.5
# dtype: float64
fillna_mean(s2)
# 0 1
# 1 2
# 2 <NA>
# 3 <NA>
# 4 <NA>
# dtype: object
fillna_mean(s2, N=2)
# 0 1.0
# 1 2.0
# 2 1.5
# 3 1.5
# 4 1.5
# dtype: float64