Search code examples
rreplacemissing-datamedian

Function to replace missing with median for whole dataframe


I'm trying to write a function to replace missing values in columns with median, and that this works for both factors/characters and numerical values.

library(dplyr)
test = data.frame(a=1:6,b=c("a","b",NA,NA,NA,"c"),c=c(1,1,1,1,2,NA),d=c("a","a","c",NA,NA,"b"))

fun_rep_na = function(df){
  for(i in colnames(df)){
    j<-sym(i)
    df = df %>% mutate(!!j=if_else(is.na(!!j),median(!!j, na.rm=TRUE),!!j))
  }
}

I see that tidyr has a function called replace_na, but I'm not sure how to use this either. Anyway, a custom function is what I would like.

The code above gives me an error.


Solution

  • We can use mutate_if with median as median works only on numeric columns

    test %>% 
       mutate_if(is.numeric, list(~ replace(., is.na(.), median(., na.rm = TRUE))))
    

    If we want the value most repeated, then we may need Mode

    Mode <- function(x) {
      x <- x[!is.na(x)]
      ux <- unique(x)
      ux[which.max(tabulate(match(x, ux)))]
    }
    

    The Mode function was first updated here

    test %>% 
      mutate_all(list(~ replace(., is.na(.), Mode(.))))
    #  a b c d
    #1 1 a 1 a
    #2 2 b 1 a
    #3 3 a 1 c
    #4 4 a 1 a
    #5 5 a 2 a
    #6 6 c 1 b