Search code examples
rmathstatisticsimputation

NA replacement using mean or median value? which will be better for my data?


I have the following dataset:

5   3   3   5   10  10  3   8   2   12  8   6   2   5   6   5   10  4   3   5   4   3   3   5   8   3   5   6   6   1   10  3   6   6   5   8   3   4   3   4   4   3   2.5 1   4   2   2   3   5   10  4   4   6   3   2   3   8   3   4   4   3   3   4   8   4   4   2   4   4   3   2   10  6   3   7   3   5   3   1   4   3   4   3   4   4   2   3   2   4   7   4   6   3.5 3.5 5   3   4   3   5   3   1.5 2.5 3   7   2   5   3   4   2   4   5   3   4   5   4.5 4   6   3   2   1   3   2   2   3   4   6   2   4   2   3   6   1.5 3   3   1   4   3   3   2   3   2   2   6   3   15  1   4   5   2   6   2   4   8   2   8   4   4   4   3   8   4   4   8.5 3   2   7   0.5 3   3   3   2   3   2   4   5   6   2   3.5 3   3   2   2   2.5 2   2   5   2   8   2   4   3   3   2   7   2   4   2   4   4   3   2.5 3   3   3   5 NA NA NA NA NA  NA NA NA NA NA NA NA NA NA NA

I want to replace NA's using either Mean or Median value imputation method.

Which method would be appropriate in such a case, and why?

Please help me learning.

Thanks.

In R I am trying the same with Median using:

# replacing with Median
df$val[is.na(df$val)] <- with(df, 
                                  ave(val, FUN = function(x) 
                                            median(x, na.rm = TRUE)) [is.na(df$val)]

I have a feeling that this is not correct way of imputation.

Can someone help in clarifying my doubts:

  1. Will there be any effects on median imputation, given that there are some values with high frequencies and others with low freq.
  2. Because of outliers, imputation with "mean" would not be a good idea. So what alternative methods could be there?

Thanks.


Solution

  • it depends on the distribution of data. if there are many outiers use median for missing value imputation.

    best is to do

    data is df$val

    df2$val=na.omit(df$val)
    
    summary(df2$val)
    
    hist(df2$val)
    

    then

    Replacing by mean

    df$val=ifelse(is.na(df$val),mean(df$val,na.rm=T),df$val)
    

    Replacing by median

    df$val=ifelse(is.na(df$val),median(df$val,na.rm=T),df$val)