Search code examples
rboxplotoutliers

How to use a generic method to remove outliers only if they exist in R


I am using a method to remove univariate outliers. This method only works if the vector contains outliers.

How is it possible to generalize this method to work also with vectors without outliers. I tried with ifelse without success.

library(tidyverse)

df <- tibble(x = c(1,2,3,4,5,6,7,80))

outliers <- boxplot(df$x, plot=FALSE)$out
print(outliers)
#> [1] 80

# This removes the outliers
df2 <- df[-which(df$x %in% outliers),]

# a new tibble withou outliers
df3 <- tibble(x = c(1,2,3,4,5,6,7,8))

outliers3 <- boxplot(df3$x, plot=FALSE)$out
print(outliers3) # no outliers
#> numeric(0)

# if I try to use the same expression to remove 0 outliers
df4 <- df[-which(df3$x %in% outliers),]

# boxplot gives an error because df4 has 0 observations
# when I was expecting 8 observations
boxplot(df4$x)
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> Error in plot.window(xlim = xlim, ylim = ylim, log = log, yaxs = pars$yaxs): need finite 'ylim' values

Solution

  • Negate (!) instead of using - which would work even when there are no outliers

    df3[!(df3$x %in% outliers3),]
    

    -output

    # A tibble: 8 x 1
          x
      <dbl>
    1     1
    2     2
    3     3
    4     4
    5     5
    6     6
    7     7
    8     8
    

    Or if there are outliers, it removes

    df[!df$x %in% outliers,]
    # A tibble: 7 x 1
          x
      <dbl>
    1     1
    2     2
    3     3
    4     4
    5     5
    6     6
    7     7