Search code examples
rdplyroutliersquantile

Remove COMPLETE row if Outlier (e.g. 99% Quantile) is Detected from EVERY Column in a Dataframe


Let's say this is my data frame.

MyData <- data.frame(
+     X = sample(10:100, 21),
+     Y = sample(10:100, 21),
+     Z = sample(10:100, 21)
+ )

I understand how to print the quantiles of each column, either with sapply or apply:

> apply( MyData , 2, quantile , .99 , na.rm = TRUE )
   X    Y    Z 
98.0 97.6 92.8 
> sapply( MyData , quantile , .99 , na.rm = TRUE )
X.99% Y.99% Z.99% 
 98.0  97.6  92.8

However, deleting the whole ROW if a value above this threshold is detected – and this for EACH column – is not working for me. Any solution – with or without dplyris appreciated.


Solution

  • We can use filter_all from dplyr to filter rows for a condition for every columns. all_vars means all the columns needs to meet the condition.

    set.seed(123)
    MyData <- data.frame(
      X = sample(10:100, 21),
      Y = sample(10:100, 21),
      Z = sample(10:100, 21)
    )
    
    head(MyData)
    #   X  Y  Z
    # 1 36 73 47
    # 2 80 67 43
    # 3 46 98 23
    # 4 87 99 22
    # 5 91 71 30
    # 6 13 56 50
    
    library(dplyr)
    
    MyData2 <- MyData %>% filter_all(all_vars(. <= quantile(., 0.99, na.rm = TRUE)))
    
    head(MyData2)
    #    X  Y  Z
    # 1 36 73 47
    # 2 80 67 43
    # 3 46 98 23
    # 4 91 71 30
    # 5 13 56 50
    # 6 54 60 32