I have a dataframe that looks a bit like this reproducible code and I want to remove the outliers (in our case datapoints below or above 2.5 standard deviations above the mean) for each column without removing the entire subject/row.
Subj <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Var1 <- c("1", "5", "100", "0.1", "3", "5", "2", "3", "2.5", "4")
Var2 <- sample(1:10, 10, replace=TRUE)
Var3 <- runif(10, min=0, max=700)
Var4 <- c("0.5", "0.1", "23", "0.2", "0.4", "0.6", "0.12", "0.3", "0.25", "-75")
df <- as.data.frame(cbind(Subj, Var1, Var2, Var3, Var4))
df$Var1_scale <- scale(as.numeric(df$Var1), scale = TRUE)
df$Var2_scale <- scale(as.numeric(df$Var2))
df$Var3_scale <- scale(as.numeric(df$Var3))
df$Var4_scale <- scale(as.numeric(df$Var4))
I would want to eliminate two datapoints - 100 for Var1 and -75 for Var4 based on the scaled variables. What is the best way of doing this? I always see it as eliminating the rows but that's not the aim here.
The output would look something like this (i.e. blank space instead of outlier)
Subj Var1 Var2 Var3 Var4 Var1_scale Var2_scale Var3_scale Var4_scale
1 A 1 9 82.5652090134099 0.5 -0.3757658 0.8660254 -1.2116275 0.2128018
2 B 5 2 606.970524415374 0.1 -0.2457431 -1.1547005 1.2318109 0.1971919
3 C 9 422.833283618093 23 2.8422981 0.8660254 0.3738333 1.0908581
4 D 0.1 10 100.154890632257 0.2 -0.4050210 1.1547005 -1.1296693 0.2010944
5 E 3 4 144.251625519246 0.4 -0.3107545 -0.5773503 -0.9242029 0.2088993
6 F 5 2 310.489796195179 0.6 -0.2457431 -1.1547005 -0.1496251 0.2167043
7 G 2 8 624.485966027714 0.12 -0.3432602 0.5773503 1.3134231 0.1979724
8 H 3 3 617.240970185958 0.3 -0.3107545 -0.8660254 1.2796654 0.2049969
9 I 2.5 10 293.290452379733 0.25 -0.3270073 1.1547005 -0.2297645 0.2030456
10 J 4 3 223.737383470871 -0.2782488 -0.8660254 -0.5538433 -2.7335648
Instead of replacing them with blank replace them with NA
so that the classes are maintained.
cols <- paste0('Var', 1:4)
mat <- sapply(df[cols], function(x) {
mn <- mean(x, na.rm = TRUE)
sd <- sd(x, na.rm = TRUE)
(x > mn + sd * 2.5) | (x < mn - sd * 2.5)
})
df[cols][mat] <- NA
df
# Subj Var1 Var2 Var3 Var4
#1 A 1.0 2 383.35261 0.50
#2 B 5.0 8 498.22071 0.10
#3 C NA 6 272.23357 23.00
#4 D 0.1 1 70.61119 0.20
#5 E 3.0 2 649.11146 0.40
#6 F 5.0 7 198.26275 0.60
#7 G 2.0 7 413.40121 0.12
#8 H 3.0 4 77.25242 0.30
#9 I 2.5 9 588.35492 0.25
#10 J 4.0 8 222.57458 NA
data
You have created dataset in a way which changes numbers to characters which makes it difficult to perform any mathematical calculation on them. I use type.convert
to change them to their original classes.
df <- type.convert(df)