Search code examples
rloopsvectorizationapply

How to avoid looping over rows and columns to increase speed in R


I am a new R user, and need to use the software for my first job. I tried looking for a similar issue to mine on the website, but haven't found one. Apologies if my question is redundant.

The problem I have is that I need to edit outliers in every column. A reproduceable example is below:

    data_X <- matrix(data = rep(1,100), nrow = 10, ncol = 10)

for (i in 1:nrow(data_x)) {
  for (j in 1:ncol(data_x)) {
    if (is.na(data_x[i,j])) {
      data_x[i,j] <- NA
    } else if (data_x[i,j]>(quantile(data_x[[j]], 0.75, na.rm=T)+1.5*(quantile(data_x[[j]], 0.75,na.rm=T)-quantile(data_x[[j]], 0.25,na.rm=T)))) {
      data_x[i,j]=(quantile(data_x[[j]], 0.5, na.rm=T))
    } else if (data_x[i,j]<(quantile(data_x[[j]], 0.25, na.rm=T)-1.5*(quantile(data_x[[j]], 0.75, na.rm=T)-quantile(data_x[[j]], 0.25, na.rm=T)))) {
      data_x[i,j]=(quantile(data_x[[j]], 0.5, na.rm=T))
    } else {
      data_x[i,j]=data_x[i,j]
    }
  }
}

In reality, the matrix is of a much larger dimension, and it takes about 4 minutes to loop through the code. This is way too long for my purposes, and I wonder if there is a more elegant way.

I have done some research, and apparently apply() would not improve speed...

Edit:

Rules:

Datapoints above the 75% quantile + 1.5 * The interquartile spread;

and

Datapoints below the 25% quantile - 1.5 * The interquantile spread;

Are converted to the median.


Solution

  • 1.We create a rule function where we make use of the vectorized ifelse.

    rule_function <- function(x) {
      
      q25 <- quantile(x, 0.25, na.rm = TRUE)
      q75 <- quantile(x, 0.75, na.rm = TRUE)
      iqr <- q75 - q25
      lower <- q25 - 1.5 * iqr
      upper <- q75 + 1.5 * iqr
      
      result <- ifelse(x < lower | x > upper, median(x, na.rm = TRUE), x)
    
      return(result)  
    }
    

    2.And then we apply the function to each column of the matrix:

    apply(data_X, 2, rule_function)
    

    The example data doesn't really allow testing, so I am not 100% sure if this helps you or not. However, this took only a few seconds for a 10000 x 10000 matrix (if that is good or not depends on your actual usecase ;)