I am a beginner and would like to
Some dummy data:
code=c("A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A2","A3","A1","A2","A3","A1","A2"),
duration=c(100,100,100,200,200,200,23523,213123,12,23213,968,37253,573012,472662,3846516,233,262,5737,3038,2,5,123,969,6,40582)
)
At the moment, I am able to produce the results across all codes, see below. But I have problems i) to run the statistics for every code (would group_by(code) work?) and then ii) to exclude the found outliers ($out) for every code.
library(robustbase)
adjboxStats(data$duration, coef = 1.5, a = -4, b = 3, do.conf = TRUE, do.out = TRUE)
$stats
[1] 2 100 262 23523 573012
$n
[1] 50
$conf
[1] -4971.77 5495.77
$fence
[1] -571.2153 707257.8400
$out
[1] 3846516 3846516
Thank you very much in advance for your help!
We can do a group by and summarise
in a list
library(dplyr)
library(robustbase)
data1 <- data %>%
group_by(code) %>%
summarise(out = list(adjboxStats(duration, coef = 1.5,
a = -4, b = 3, do.conf = TRUE, do.out = TRUE)))
data1
# A tibble: 3 x 2
# code out
# <chr> <list>
#1 A1 <named list [5]>
#2 A2 <named list [5]>
#3 A3 <named list [5]>
data1$out[[1]]
#$stats
#[1] 5.0 53.0 216.5 23368.0 573012.0
#$n
#[1] 8
#$conf
#[1] -12807.59 13240.59
#$fence
#[1] -624.4143 696935.1967
#$out
#numeric(0)
If we are interested in filter
ing out the outliers, then use %in%
with !
after extracting the 'out' component
data %>%
group_by(code) %>%
filter(!duration %in% adjboxStats(duration, coef = 1.5,
a = -4, b = 3, do.conf = TRUE, do.out = TRUE)$out)
# A tibble: 24 x 2
# Groups: code [3]
# code duration
# <chr> <dbl>
# 1 A1 100
# 2 A2 100
# 3 A3 100
# 4 A1 200
# 5 A2 200
# 6 A3 200
# 7 A1 23523
# 8 A2 213123
# 9 A3 12
#10 A1 23213
# … with 14 more rows
data <- structure(list(code = c("A1", "A2", "A3", "A1", "A2", "A3", "A1",
"A2", "A3", "A1", "A2", "A3", "A1", "A2", "A3", "A1", "A2", "A3",
"A2", "A3", "A1", "A2", "A3", "A1", "A2"), duration = c(100,
100, 100, 200, 200, 200, 23523, 213123, 12, 23213, 968, 37253,
573012, 472662, 3846516, 233, 262, 5737, 3038, 2, 5, 123, 969,
6, 40582)), class = "data.frame", row.names = c(NA, -25L))