Search code examples
rdplyrtidyverseplyr

Removing outliers within each group of a dataframe


i have the following dataset in R

data <- structure(list(BatcBatchNo = structure(c(9L, 9L, 9L, 9L, 
    9L, 9L, 9L, 9L, 9L, 9L), .Label = c("Batch18200616", "Batch18200702", 
    "Batch18200703", "Batch18200704", "Batch18200705", "Batch18200708", "Batch18200709", 
    "Batch18200710", "Batch18200711", "Batch20200712", "Batch20200715", "Batch21200701", 
    "Batch21200703", "Batch21200704", "Batch21200705", "Batch21200706", "Batch21200708", 
    "Batch21200709", "Batch22200630", "Batch22200701", "Batch22200702", "Batch22200707", 
    "Batch23200620", "Batch23200701", "Batch23200702", "Batch23200703", "Batch23200704", 
    "Batch23200706", "Batch24200717", "Batch25200707", "Batch54200711", "Batch55200705", 
    "Batch55200706", "Batch55200707", "Batch56200701", "Batch56200702", "Batch56200704", 
    "Batch56200705", "Batch56200709", "Batch56200710", "Batch57200701", "Batch57200702", 
    "Batch57200703", "Batch57200704", "Batch57200706", "Batch57200708", "Batch57200709", 
    "Batch57200710", "Batch57200711", "Batch57200712", "Batch57200714", "Batch57200717", 
    "Batch58200701", "Batch58200702", "Batch58200703", "Batch58200704", "Batch58200705", 
    "Batch58200708", "Batch58200710", "Batch58200712", "Batch58200713", "Batch59200622", 
    "Batch59200701", "Batch59200702", "Batch59200704", "Batch59200705", "Batch59200706", 
    "Batch59200707", "Batch59200708", "Batch59200709", "Batch60200618", "Batch60200702", 
    "Batch60200705", "Batch60200708"), class = "factor"), SetValue = c(690, 
    690, 690, 690, 690, 690, 690, 690, 690, 690), ActualValue = c(705, 
    706, 706, 705, 705, 704, 704, 704, 705, 705), ONCondition = c(TRUE, 
    TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)), row.names = c(NA, 
    10L), class = "data.frame")

> data
     BatcBatchNo SetValue ActualValue ONCondition
1  Batch18200711      690         705        TRUE
2  Batch18200711      690         706        TRUE
3  Batch18200711      690         706        TRUE
4  Batch18200711      690         705        TRUE
5  Batch18200711      690         705        TRUE
6  Batch18200711      690         704        TRUE
7  Batch18200711      690         704        TRUE
8  Batch18200711      690         704        TRUE
9  Batch18200711      690         705        TRUE
10 Batch18200711      690         705        TRUE

i need to calculate standard deviation for each Batch & Set Value. But before calculating its standard deviation i need to remove the outliers in that Batch.

Means i need to perform following steps

  1. Remove Outliers in Actual Value within each batch. Outliers to be calculated on batch to batch basis and not on overall dataset
  2. Perform standard deviation on batch n Set value combined.

I was trying to use dplyr functions to calculation the standard deviation but then its not taking care of the outliers.

This code doest take care of outliers

Output= Data%>%
   group_by(BatchNo)%>%
     group_by(SetValue)%>%
      summarize(Mean= mean(ActualValue),SD= sd(ActualValue))

How do I proceed in this case.


Solution

  • you can use filter to remove "outliers` accroding to the logic mentioned in the comments:

    Data%>%
      group_by(BatchNo) %>%
      filter(ActualValue <= quantile(ActualValue, 0.99), ActualValue >= quantile(ActualValue, 0.01)) %>%
      group_by(BatchNo, SetValue) %>%
      summarize(Mean = mean(ActualValue), SD = sd(ActualValue))