Search code examples
rgroup-bymeanboxplotoutliers

Applying adjusted boxplot method adjboxstats() with group_by in R?


I am a beginner and would like to

  1. produce adjboxStats() for every code in my data (see below)
  2. eliminate the outliers for every code

Some dummy data:

code=c("A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A2","A3","A1","A2","A3","A1","A2"),
duration=c(100,100,100,200,200,200,23523,213123,12,23213,968,37253,573012,472662,3846516,233,262,5737,3038,2,5,123,969,6,40582)
)

At the moment, I am able to produce the results across all codes, see below. But I have problems i) to run the statistics for every code (would group_by(code) work?) and then ii) to exclude the found outliers ($out) for every code.

library(robustbase)
adjboxStats(data$duration, coef = 1.5, a = -4, b = 3, do.conf = TRUE, do.out = TRUE)
$stats
[1]      2    100    262  23523 573012

$n
[1] 50

$conf
[1] -4971.77  5495.77

$fence
[1]   -571.2153 707257.8400

$out
[1] 3846516 3846516

Thank you very much in advance for your help!


Solution

  • We can do a group by and summarise in a list

    library(dplyr)
    library(robustbase)
    data1 <- data %>%
                group_by(code) %>%
                summarise(out = list(adjboxStats(duration, coef = 1.5,
                     a = -4, b = 3, do.conf = TRUE, do.out = TRUE)))
    
    
    data1
    # A tibble: 3 x 2
    #  code  out             
    #  <chr> <list>          
    #1 A1    <named list [5]>
    #2 A2    <named list [5]>
    #3 A3    <named list [5]>
    
    
    data1$out[[1]]
    #$stats
    #[1]      5.0     53.0    216.5  23368.0 573012.0
    
    #$n
    #[1] 8
    
    #$conf
    #[1] -12807.59  13240.59
    
    #$fence
    #[1]   -624.4143 696935.1967
    
    #$out
    #numeric(0)
    

    If we are interested in filtering out the outliers, then use %in% with ! after extracting the 'out' component

    data %>% 
        group_by(code) %>% 
        filter(!duration %in%  adjboxStats(duration, coef = 1.5,
                      a = -4, b = 3, do.conf = TRUE, do.out = TRUE)$out)
    # A tibble: 24 x 2
    # Groups:   code [3]
    #   code  duration
    #   <chr>    <dbl>
    # 1 A1         100
    # 2 A2         100
    # 3 A3         100
    # 4 A1         200
    # 5 A2         200
    # 6 A3         200
    # 7 A1       23523
    # 8 A2      213123
    # 9 A3          12
    #10 A1       23213
    # … with 14 more rows
    

    data

    data <- structure(list(code = c("A1", "A2", "A3", "A1", "A2", "A3", "A1", 
    "A2", "A3", "A1", "A2", "A3", "A1", "A2", "A3", "A1", "A2", "A3", 
    "A2", "A3", "A1", "A2", "A3", "A1", "A2"), duration = c(100, 
    100, 100, 200, 200, 200, 23523, 213123, 12, 23213, 968, 37253, 
    573012, 472662, 3846516, 233, 262, 5737, 3038, 2, 5, 123, 969, 
    6, 40582)), class = "data.frame", row.names = c(NA, -25L))