Search code examples
routliers

Remove outlier from a single cell in R


I am a newbie in R and I am stuck with a problem removing some outliers. I have a dataframe which is something like this:

Item1   Item2   Item3
 4.05    3.9   3.6
 12      3.7   4
 4.01    3.8   4

My desired result should be something like the table below, namely a table where the outliers of every column are removed

Item1  Item2  Item3 
4.05    3.9    3.6
NA      3.7    4
4.01    3.8    4 

So far I have written a code which can detect the outliers, but I am stuck with removing them, as the entire column changes instead of the single value.

 find_outlier <- function(log_reaction_time) {
media <- mean(log_reaction_time)
devst <- sd(log_reaction_time)
result <-which(log_reaction_time < media - 2 * devst | log_reaction_time > media + 2 * devst)
log_reaction_time2 <- ifelse (log_reaction_time %in% result, NA, log_reaction_time)
}
apply(log_reaction_time, 2, find_outlier)

I guess the problem comes from the fact that I apply the function over the columns (2), as I want to find the outliers of the column, but then I want to remove only the relevant values...


Solution

  • We will use same dataset to show this:

    #Data
    df1 <- structure(list(Item1 = c(4.05, 12, 4.01), Item2 = c(3.9, 3.7, 
    3.8), Item3 = c(3.6, 4, 4)), class = "data.frame", row.names = c(NA, 
    -3L))
    
    df1
      Item1 Item2 Item3
    1  4.05   3.9   3.6
    2 12.00   3.7   4.0
    3  4.01   3.8   4.0
    

    Now the function:

    #Function
    find_outlier <- function(log_reaction_time) {
      media <- mean(log_reaction_time)
      devst <- sd(log_reaction_time)
      result <-which(log_reaction_time < media - 2 * devst | log_reaction_time > media + 2 * devst)
      log_reaction_time[result] <- NA
      return(log_reaction_time)
    }
    
    apply(df1, 2, find_outlier)
    
         Item1 Item2 Item3
    [1,]  4.05   3.9   3.6
    [2,] 12.00   3.7   4.0
    [3,]  4.01   3.8   4.0
    

    To highlight, second value for Item1 is not set to NA because mean(df1$Item1)=6.69 and sd(df1$Item1)=4.60. So when the condition checks in the intervals you will have mean(df1$Item1)-2*sd(df1$Item1)=-2.51 and mean(df1$Item1)+2*sd(df1$Item1)=15.89 where 12 is not in those limits. You will have to define other criteria to assign it NA.