Eliminate outlier datapoints in R

I have a dataframe that looks a bit like this reproducible code and I want to remove the outliers (in our case datapoints below or above 2.5 standard deviations above the mean) for each column without removing the entire subject/row.

Subj <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Var1 <- c("1", "5", "100", "0.1", "3", "5", "2", "3", "2.5", "4")
Var2 <- sample(1:10, 10, replace=TRUE)
Var3 <- runif(10, min=0, max=700)
Var4 <- c("0.5", "0.1", "23", "0.2", "0.4", "0.6", "0.12", "0.3", "0.25", "-75")

df <- as.data.frame(cbind(Subj, Var1, Var2, Var3, Var4))

df$Var1_scale <- scale(as.numeric(df$Var1), scale = TRUE)
df$Var2_scale <- scale(as.numeric(df$Var2))
df$Var3_scale <- scale(as.numeric(df$Var3))
df$Var4_scale <- scale(as.numeric(df$Var4))

I would want to eliminate two datapoints - 100 for Var1 and -75 for Var4 based on the scaled variables. What is the best way of doing this? I always see it as eliminating the rows but that's not the aim here.

The output would look something like this (i.e. blank space instead of outlier)

   Subj Var1 Var2             Var3 Var4 Var1_scale Var2_scale Var3_scale Var4_scale
1     A    1    9 82.5652090134099  0.5 -0.3757658  0.8660254 -1.2116275  0.2128018
2     B    5    2 606.970524415374  0.1 -0.2457431 -1.1547005  1.2318109  0.1971919
3     C         9 422.833283618093   23  2.8422981  0.8660254  0.3738333  1.0908581
4     D  0.1   10 100.154890632257  0.2 -0.4050210  1.1547005 -1.1296693  0.2010944
5     E    3    4 144.251625519246  0.4 -0.3107545 -0.5773503 -0.9242029  0.2088993
6     F    5    2 310.489796195179  0.6 -0.2457431 -1.1547005 -0.1496251  0.2167043
7     G    2    8 624.485966027714 0.12 -0.3432602  0.5773503  1.3134231  0.1979724
8     H    3    3 617.240970185958  0.3 -0.3107545 -0.8660254  1.2796654  0.2049969
9     I  2.5   10 293.290452379733 0.25 -0.3270073  1.1547005 -0.2297645  0.2030456
10    J    4    3 223.737383470871      -0.2782488 -0.8660254 -0.5538433 -2.7335648

Solution

Instead of replacing them with blank replace them with NA so that the classes are maintained.

cols <- paste0('Var', 1:4)
mat <- sapply(df[cols], function(x) {
  mn <- mean(x, na.rm = TRUE)
  sd <- sd(x, na.rm = TRUE)
  (x > mn + sd * 2.5) | (x < mn - sd * 2.5)
})
df[cols][mat] <- NA

df
#   Subj Var1 Var2      Var3  Var4
#1     A  1.0    2 383.35261  0.50
#2     B  5.0    8 498.22071  0.10
#3     C   NA    6 272.23357 23.00
#4     D  0.1    1  70.61119  0.20
#5     E  3.0    2 649.11146  0.40
#6     F  5.0    7 198.26275  0.60
#7     G  2.0    7 413.40121  0.12
#8     H  3.0    4  77.25242  0.30
#9     I  2.5    9 588.35492  0.25
#10    J  4.0    8 222.57458    NA

data

You have created dataset in a way which changes numbers to characters which makes it difficult to perform any mathematical calculation on them. I use type.convert to change them to their original classes.

df <- type.convert(df)