I have the following dataset:
5 3 3 5 10 10 3 8 2 12 8 6 2 5 6 5 10 4 3 5 4 3 3 5 8 3 5 6 6 1 10 3 6 6 5 8 3 4 3 4 4 3 2.5 1 4 2 2 3 5 10 4 4 6 3 2 3 8 3 4 4 3 3 4 8 4 4 2 4 4 3 2 10 6 3 7 3 5 3 1 4 3 4 3 4 4 2 3 2 4 7 4 6 3.5 3.5 5 3 4 3 5 3 1.5 2.5 3 7 2 5 3 4 2 4 5 3 4 5 4.5 4 6 3 2 1 3 2 2 3 4 6 2 4 2 3 6 1.5 3 3 1 4 3 3 2 3 2 2 6 3 15 1 4 5 2 6 2 4 8 2 8 4 4 4 3 8 4 4 8.5 3 2 7 0.5 3 3 3 2 3 2 4 5 6 2 3.5 3 3 2 2 2.5 2 2 5 2 8 2 4 3 3 2 7 2 4 2 4 4 3 2.5 3 3 3 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I want to replace NA's using either Mean or Median value imputation method.
Which method would be appropriate in such a case, and why?
Please help me learning.
Thanks.
In R I am trying the same with Median using:
# replacing with Median
df$val[is.na(df$val)] <- with(df,
ave(val, FUN = function(x)
median(x, na.rm = TRUE)) [is.na(df$val)]
I have a feeling that this is not correct way of imputation.
Can someone help in clarifying my doubts:
Thanks.
it depends on the distribution of data. if there are many outiers use median for missing value imputation.
best is to do
data is df$val
df2$val=na.omit(df$val)
summary(df2$val)
hist(df2$val)
then
Replacing by mean
df$val=ifelse(is.na(df$val),mean(df$val,na.rm=T),df$val)
Replacing by median
df$val=ifelse(is.na(df$val),median(df$val,na.rm=T),df$val)