I need to mark columns containing outliers in R -- preferably using a while loop for so its easy to apply to other situations. I would like to create a new variable for each column denoting if the outlier is in the greater than or lower than IQR bound for each column.
Assuming I'm using the txhousing dataset I am trying to end up with the following columns
city year ... city city.out.up city.out.down year year.out.up year.out.down ...
My solution looks like this so far (I've tried using paste() so far):
while (i < 9) {
iqr <- IQR(df[,i], na.rm = TRUE)
fiver <- fivenum(df[,i])
lowerbound <- fiver[2] - (1.5*iqr)
mutate(df, VAR.out.up = case_when(df[,i] <= lowerbound ~ 1, df[,i] > lowerbound ~ 0))
upperbound <- fiver[4] + (1.5*iqr)
mutate(df, VAR.out.up = case_when(df[,i] >= upperbound ~ 1, df[,i] < upperbound ~ 0))
boxplot(df[,i], main = colnames(df[,i]))
i = i + 1
}
Is there a way to create dynamic variable names using a predetermined suffix using mutate?
Create a function that ingests your vector of interest and returns a two column data table with column names out.down
and out.up
.. You can adjust the function f()
below for your purposes:
f <- function(x) {
q = quantile(x,p=c(0.25, 0.75), na.rm=T)
data.table(out.down = x<q[1], out.up = x>q[2])
}
Then just apply the function to the numeric columns of your dataset
library(data.table)
df = setDT(ggplot2::txhousing)
cbind(df, df[,lapply(.SD, f), .SDcols = is.numeric])