Search code examples
rdplyroutliers

R boxplot using apply() for saving outliers from boxplot() groupby


Data reshape or apply() function to discover and then save outliers from the boxplot() function, while grouping data by a group identifier.

Mine first attempt is to create a function that has a boxplot() function inside to capture the outliers, e.g,. boxplot(...)$out; then return the $out (outliers) and apply result to table df.events$outliers. The final target is to have a table with outliers by group, e.g,

e.g., OutliersByGroupTableName
group_id_name
outliers_from_boxplot

Then a boxplot() with a select() using a range of date events could be added to a new field column, for form the following table.

e.g., OutliersByGroupTableName
group_id_name
outliers_from_boxplot
time_range_outliers_from_boxplot

With this code, mine attempt was to create boxplot() inside function. Use apply in R to navigate "group" and "rank", call FUN=test_func(df.events) with dataframe. This is where I am having issues on using the apply to forward to a boxplot() function and return next to a table field (not shown in this code view). Alternately, is apply() the best approach for this investigation?

test_func <- function(df) {
  boxplot(df$rank ~ df$group, data=df, plot=FALSE, )$out
}
apply(df.events, c("group","rank"), FUN=test_func(df.events))

Data (dput)

> dput(head(df.events, 50))
structure(list(rank = c(0.5, 0.5, 0.5, 0.5, 0, 1, 1, 1, 1, 0, 
0, 0, 0.25, 0.25, 0, 2, 2, 2, 0, 0, 2, 2, 0, 1, 1, 0, 0, 0, 0, 
0.25, 0.25, 0.6, 0.6, 0, 0, 3, 3, 0.5, 0.5, 0.5, 3, 3, 3, 1.5, 
1, 1, 0, 1, 1, 0), group = c(751, 728, 753, 808, 909, 909, 920, 
728, 686, 727, 1025, 727, 728, 808, 750, 752, 752, 782, 752, 
686, 752, 808, 691, 920, 920, 727, 727, 782, 991, 727, 808, 
686, 728, 1025, 686, 920, 986, 782, 736, 909, 686, 782, 751, 
728, 782, 782, 909, 909, 686, 686), outliers = c("NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
"NA", "NA", "NA", "NA")), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))
> 

Solution

  • If we need to dynamically pass the names of the rank column and the group column, then create those as arguments along with the dataset, then a formula can be created with paste and apply the boxplot

    test_func <- function(df, colnm, grpcol){
           boxplot(as.formula(paste0(colnm, ' ~ ', grpcol)), data = df, plot = FALSE)
      }
    

    and then we can apply as

    out <- test_func(df.events, 'rank', 'group')
    str(out)
    #List of 6
    # $ stats: num [1:5, 1:16] 0 0 0.6 1 1 0 0 0 0 0 ...
    # $ n    : num [1:16] 7 1 5 5 1 1 2 4 1 6 ...
    # $ conf : num [1:2, 1:16] 0.00282 1.19718 0 0 0 ...
    # $ out  : num [1:2] 3 0.25
    # $ group: num [1:2] 1 3
    # $ names: chr [1:16] "686" "691" "727" "728" ..