Search code examples
rsimulationmissing-dataoutliers

Replace outliers with NA


I have found this function and I would like to adapt it to replace outliers with NA instead of removing the observation.

I have tried to add <-NA in this line data <- data[!outliers(data[[col]]),] but I cannot make it work. Could you help me to adapt it, please?

Here you can find the code with some simulated data. Please let me know if you need something else.

Thank you so much in advance.

cov.matone <- matrix(c(1, .0,
                       .0, 1), nrow = 2)

data <- data.frame(MASS::mvrnorm(n = 1e4, 
                                  mu = c(4, 4), 
                                  Sigma = cov.matone))

outliers <- function(x) {
  
  Q1 <- quantile(x, probs=.25, na.rm=T)
  Q3 <- quantile(x, probs=.75, na.rm=T)
  iqr = Q3-Q1
  
  upper_limit = Q3 + (iqr*1.5)
  lower_limit = Q1 - (iqr*1.5)
  
  x > upper_limit | x < lower_limit
}

remove_outliers <- function(data, cols = names(data)) {
  for (col in cols) {
    data <- data[!outliers(data[[col]]),]
  }
  data
}

data_nooutliers <- remove_outliers(data, c('X1', 'X2' ))

Solution

  • Instead of assigning the loop results to the input data, use is.na<- to assign NA values to elements given by function outliers.

    remove_outliers <- function(data, cols = names(data)) {
      for (col in cols) {
        is.na(data[[col]]) <- outliers(data[[col]])
      }
      data
    }
    

    Note

    The following function does exactly the same as function outliers but is a much simpler one-liner.

    outliers2 <- function(x) x %in% boxplot.stats(x)$out
    
    s1 <- lapply(names(data), \(x) outliers(data[[x]]))
    s2 <- lapply(names(data), \(x) outliers2(data[[x]]))
    identical(s1, s2)
    #[1] TRUE