Search code examples
rdatasetlapplyoutliers

How to remove outliers from list of dataset


I am trying to remove outliers from a list of two dataset:

#creation of dataset
repr = list(mtcars, airquality)

#detectig boxplot
g_stats = lapply(repr, function(x) boxplot(x, main = "Boxplot")$out)

This is the code that I have applied with lapply:

new = lapply(repr, function(x) 
  x[ !(x %in% g_stats), ])

unfortunately, I can see that in the new list of datasets, there is no difference at all (there should be a difference in row numbers at least, but I am not to make work the lapply function conditionally to the list with outliers).

I have also tried to build properly the outlier box

#for getting the ID corresponding to the values
id_out1 = lapply(repr, function(x) as.data.frame(Boxplot(x, id = TRUE)))

#for getting the real values
out1 = lapply(repr, 
             function(x) as.data.frame(boxplot(x, main = "Boxplot", plot = TRUE)$out))

outliers1 =  NULL
seq = c(1, 2)

names = c('ID', 'value')
for (i in seq_along(seq)) {
  outliers1[[seq[i]]] = if(nrow(id_out1[[i]]) == nrow(out1[[i]]))
  {cbind(id_out1[[i]], out1[[i]])} else {next}
  colnames(outliers1[[seq[i]]]) = names
  }

But to me, it is pretty hard to exclude values in a list that conditionally to ID list and values in outliers1 list.

Can anyone suggest something?


Solution

  • We can use Map: it's similar to lapply, but instead of accepting just one list, it accepts an arbitrary number of lists.

    repr = list(mtcars, airquality)
    g_stats = lapply(repr, function(x) boxplot(x, main = "Boxplot", plot = FALSE)$out)
    sapply(repr, nrow)
    # [1]  32 153
    repr2 <- Map(function(x, out) x[rowSums(apply(x, 1, `%in%`, out)) == 0,], repr, g_stats)
    sapply(repr2, nrow)
    # [1] 18 75
    

    The data is indeed different:

    lapply(repr, head)
    # [[1]]
    #                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    # Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    # Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    # Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    # Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    # Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    # Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    # [[2]]
    #   Ozone Solar.R Wind Temp Month Day
    # 1    41     190  7.4   67     5   1
    # 2    36     118  8.0   72     5   2
    # 3    12     149 12.6   74     5   3
    # 4    18     313 11.5   62     5   4
    # 5    NA      NA 14.3   56     5   5
    # 6    28      NA 14.9   66     5   6
    lapply(repr2, head)
    # [[1]]
    #                    mpg cyl  disp  hp drat   wt  qsec vs am gear carb
    # Mazda RX4         21.0   6 160.0 110 3.90 2.62 16.46  0  1    4    4
    # Datsun 710        22.8   4 108.0  93 3.85 2.32 18.61  1  1    4    1
    # Hornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
    # Merc 240D         24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2
    # Merc 230          22.8   4 140.8  95 3.92 3.15 22.90  1  0    4    2
    # Merc 280          19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4
    # [[2]]
    #    Ozone Solar.R Wind Temp Month Day
    # 4     18     313 11.5   62     5   4
    # 5     NA      NA 14.3   56     5   5
    # 6     28      NA 14.9   66     5   6
    # 10    NA     194  8.6   69     5  10
    # 11     7      NA  6.9   74     5  11
    # 12    16     256  9.7   69     5  12
    

    I think your case is understated, so I'll state an assumption that seems more reasonable:

    You want to remove a row if a value in it is an outlier relative to its own column. That is, in your code, a number is considered as an outlier among all columns, but I think you should only consider it an outlier compared with its own column.

    For this, a simple function:

    anyoutlier <- function(dat) rowSums(sapply(dat, function(z) z %in% boxplot.stats(z)$out)) > 0
    anyoutlier(mtcars)
    #  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
    # [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
    

    Which we'll apply to each column of each frame:

    repr <- list(mtcars, airquality)
    repr2 <- lapply(repr, function(dat) dat[!anyoutlier(dat),])
    sapply(repr2, nrow)
    # [1]  28 148
    sapply(repr, nrow)
    # [1]  32 153
    lapply(repr2, head)
    # [[1]]
    #                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    # Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    # Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    # Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    # Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    # Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    # Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    # [[2]]
    #   Ozone Solar.R Wind Temp Month Day
    # 1    41     190  7.4   67     5   1
    # 2    36     118  8.0   72     5   2
    # 3    12     149 12.6   74     5   3
    # 4    18     313 11.5   62     5   4
    # 5    NA      NA 14.3   56     5   5
    # 6    28      NA 14.9   66     5   6