I've found a few answers similar to what I'm looking for, but the code has not worked.
I need to replace all outliers (as defined for our purposes as more than 2 SD away from the mean), of each column with NA or NULL. I'm trying to have this iterate across all columns, calculating the outliers for each column and replacing those values.
Disclosure: I also know statistically there are strong views on the removal of outliers, and the use of SD as a measure to exclude them given that SD is calculated including the outliers in this case. These were my instructions from our statistician, so I am working off that for now.
This is a glimpse of my dataset:
data
Group sp.Q13813.SPTN1_HUMAN sp.O14773.TPP1_HUMAN sp.P11137.MTAP2_HUMAN
1 Premutation 10713983468 367492324 2134747097
2 Premutation 10789498495 343303410 2677825476
3 Premutation 11134883489 383589325 2132552280
4 Premutation 9723552595 269965000 2262740921
5 Premutation 11175156282 359864993 1419225650
6 Premutation 10959077349 258095035 3343267633
7 Premutation 10770809133 331554977 2763604046
8 Premutation 11098182537 344384433 2198718886
I've tried code including the scale() function, but when I performed the below functions I found it replaced values that were within 2 SD of the mean of the column, and it left my code in the scaled format. I'm not as familiar with this function and was not sure how to revert it back to the original data points, even if it was replacing the correct outlier-deemed values.
Method attempt #1
# take note of order for column names
data.names_P <- colnames(data)
# scale all numeric columns
data.numeric.Pre <- select_if(data, is.numeric) %>% # subset of numeric columns
mutate_all(scale) # perform scale separately for each column
data.numeric.Pre[data.numeric.Pre > 2] <- 99999 # set values larger than 2 to NA (none in this example)
# combine results with subset data frame of non-numeric columns
data.Pre <- data.frame(select_if(data, function(x) !is.numeric(x)),
data.numeric)
# restore columns to original order
data.Pre <- data.Pre[, data.names_P]
Method attempt #2
FindOutliers <- function(data) {
upper = (2*sd(data) + mean(data)
lower = (mean - 2*sd(data))
result <- which(data < upper | data > lower)
}
I know attempt number two does not replace the data with NA.
Any help would be immensely appreciated.
You can use a ifelse
function, here an example using dplyr
and applying the ifelse
function over all columns containing the term HUMAN
:
library(dplyr)
data %>% mutate_at(.vars = vars(contains("HUMAN")),
.funs= ~ifelse(abs(.)>mean(.)+2*sd(.), NA, .))