Search code examples
rloopsmultiple-columnsoutliers

Double loop to iterate in many columns to find outliers in R


I have a dataframe with "id" of an individual and two traits ("x" e "y") like the following:

id = c("A1","A2","A3","A4","A5","A6","A7","A8","A9","A10","A11","A12","A13","A14","A15","A16","A17","A18","A19","A20","A21","A22","A23","A24")
x = c(10,4,6,8,9,8,7,6,12,14,11,9,8,4,5,10,14,12,15,7,10,14,24,28)
y = c(1.5,1.2,5,2,0.8,4,1,1.1,1.2,1.4,1.3,1.6,0.9,0.8,1,1.1,1.3,1.5,1.2,1.1,1,1.2,1.1,1)
a = data.frame(id,x,y)

I want to have a loop to iterate over each trait and for each individual so that I can create a new dataframe (or new columns of a) in which the individual will have a 1 if it is an outlier and a 0 if it is not. Considering outlier as any point that is deviated ± 3 sd from the mean of the trait.

In this example, an outlier for "x" is 28 and for "y" is 5. The required result then could be something like:

id = c("A1","A2","A3","A4","A5","A6","A7","A8","A9","A10","A11","A12","A13","A14","A15","A16","A17","A18","A19","A20","A21","A22","A23","A24")
x_out = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1)
y_out = c(0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
a_out = data.frame(id, x_out, y_out)

Any idea how to do it in a loop? The idea is that if I include new traits or individuals, I don't need to change the loop. Thanks!


Solution

  • No need for loops, you can just test whether the absolute z-score (abs(scale())) is >= 3 for all columns at once:

    a_out <- a
    a_out[, -1] <- as.integer(abs(scale(a[, -1])) >= 3)
    
    #> a_out
        id x y
    1   A1 0 0
    2   A2 0 0
    3   A3 0 1
    4   A4 0 0
    5   A5 0 0
    6   A6 0 0
    7   A7 0 0
    8   A8 0 0
    9   A9 0 0
    10 A10 0 0
    11 A11 0 0
    12 A12 0 0
    13 A13 0 0
    14 A14 0 0
    15 A15 0 0
    16 A16 0 0
    17 A17 0 0
    18 A18 0 0
    19 A19 0 0
    20 A20 0 0
    21 A21 0 0
    22 A22 0 0
    23 A23 0 0
    24 A24 1 0
    

    Or using dplyr:

    library(dplyr)
    
    a_out <- a %>% 
      mutate(across(!id, \(x) as.integer(abs(scale(x)) >= 3)))
    # same output as above