Search code examples
rdata-cleaningoutliersiqr

Can I remove outliers from all columns in my dataframe R?


I have a data frame with 431 variables and 140 observations and I need to remove outliers. However this dataset has several NA values, and I do not want to remove all rows with NAs. I am trying to do this outlier removal by IQR method, and so far, I've been able to obtain quartiles and IQR by the following:

data <- df2[,4:434]
apply(data,2,quantile, probs=c(0.25,0.75), na.rm=TRUE) -> Quartiles
sapply(data,IQR, na.rm=TRUE) -> iqr

I've also calculated the lower and upper values for each of my columns:

Lower <- Quartiles[1,]-1.5*iqr
Upper <- Quartiles[2,]+1.5*iqr

However, when I have tried to replace the outliers by NAs, no change has been observed in my data frame:

data_no_outlier <- replace(data, data[1:431] < Lower  & data[1:431] > Upper, NA)

I have also tried to use this script to the iris data with the same unsuccessful result:

data(iris, package = "datasets")
completeData <- iris[-5]
apply(completeData,2,quantile, probs=c(0.25,0.75), na.rm=TRUE) -> Quartiles
sapply(completeData,IQR, na.rm=TRUE) -> iqr

Lower <- Quartiles[1,]-1.5*iqr
Upper <- Quartiles[2,]+1.5*iqr

data_no_outlier <- replace(completeData, completeData < Lower & completeData > Upper, NA)

Is there any way I can filter out outliers from my data, that does not require to manually select all the columns by name?


Solution

  • Here's one method:

    fun <- function(z, fac = 1.5, na.rm = TRUE) {
      Q <- quantile(z, c(0.25, 0.75), na.rm = na.rm)
      R <- IQR(z, na.rm = na.rm)
      z[z < Q[1] - fac * R | z > Q[2] + fac * R] <- NA
      z
    }
    

    Sample data:

    set.seed(42)
    quux <- data.frame(ltr = letters[1:10], num1 = c(99, runif(9)), num2 = c(runif(9), 99))
    quux
    #    ltr       num1       num2
    # 1    a 99.0000000  0.7050648
    # 2    b  0.9148060  0.4577418
    # 3    c  0.9370754  0.7191123
    # 4    d  0.2861395  0.9346722
    # 5    e  0.8304476  0.2554288
    # 6    f  0.6417455  0.4622928
    # 7    g  0.5190959  0.9400145
    # 8    h  0.7365883  0.9782264
    # 9    i  0.1346666  0.1174874
    # 10   j  0.6569923 99.0000000
    

    dplyr

    library(dplyr)
    quux %>%
      mutate(across(where(is.numeric), fun))
    #    ltr      num1      num2
    # 1    a        NA 0.7050648
    # 2    b 0.9148060 0.4577418
    # 3    c 0.9370754 0.7191123
    # 4    d 0.2861395 0.9346722
    # 5    e 0.8304476 0.2554288
    # 6    f 0.6417455 0.4622928
    # 7    g 0.5190959 0.9400145
    # 8    h 0.7365883 0.9782264
    # 9    i 0.1346666 0.1174874
    # 10   j 0.6569923        NA
    

    base R

    isnum <- sapply(quux, is.numeric)
    quux[isnum] <- lapply(quux[isnum], fun)
    quux
    #    ltr      num1      num2
    # 1    a        NA 0.7050648
    # 2    b 0.9148060 0.4577418
    # 3    c 0.9370754 0.7191123
    # 4    d 0.2861395 0.9346722
    # 5    e 0.8304476 0.2554288
    # 6    f 0.6417455 0.4622928
    # 7    g 0.5190959 0.9400145
    # 8    h 0.7365883 0.9782264
    # 9    i 0.1346666 0.1174874
    # 10   j 0.6569923        NA