Search code examples
rfor-loopdplyrgroupingsummary

How to correctly use group_by() and summarise() in a For loop in R


I'm trying to calculate some summary information to help me check for outliers in different groups in a dataset. I can get the sort of output I want using dplyr::group_by() and dplyr::summarise() - a dataframe with summary information for each group for a given variable. Something like this:

Sepal.Length_outlier_check <- iris %>%
  dplyr::group_by(Species) %>% 
  dplyr::summarise(min = min(Sepal.Length, na.rm = TRUE),
                   max = max(Sepal.Length, na.rm = TRUE),
                   median = median(Sepal.Length, na.rm = TRUE),
                   MAD = mad(Sepal.Length, na.rm = TRUE),
                   MAD_lowlim = median - (3 * MAD),
                   MAD_highlim = median + (3 * MAD),
                   Outliers_low = any(Sepal.Length < MAD_lowlim, na.rm = TRUE),
                   Outliers_high = any(Sepal.Length > MAD_highlim, na.rm = TRUE)
                   )

Sepal.Length_outlier_check

However, I'd like to be able to put this in a For loop to be able to produce similar summary dataframes for each of the different variables in the dataset. I'm new to using loops, but I was thinking it might need to look something like this:

vars <- list(colnames(iris))

for (i in vars) {

x <- iris %>%
  dplyr::group_by(Species) %>% 
  dplyr::summarise(min = min(i, na.rm = TRUE),
                   max = max(i, na.rm = TRUE),
                   median = median(i, na.rm = TRUE),
                   MAD = mad(i, na.rm = TRUE),
                   MAD_lowlim = median - (3 * MAD),
                   MAD_highlim = median + (3 * MAD),
                   Outliers_low = any(i < MAD_lowlim, na.rm = TRUE),
                   Outliers_high = any(i > MAD_highlim, na.rm = TRUE)
                   )

assign(paste(i, "Outlier_check", sep = "_"), x)

}

I know that doesn't work though because in the summary functions i isn't actually referencing any data. I'm not sure what I need to do to make it work though! I'd be very grateful for your help, or any suggestions for how to accomplish all of this more elegantly.

I'm reluctant to use dplyr::summarise_all() because it outputs one summary table for all the variables, and as the real dataset I'm working on has many variables this summary table would become too large to be able to easily review it.

Thanks.


Solution

  • You could also create these per-variable/species summaries without loops or separate functions, simply by gathering the non-Species columns, grouping, and summarizing:

    library(tidyverse)
    
    iris.summary <- iris %>% 
      gather(variable, value, -Species) %>% 
      group_by(variable, Species) %>% 
      summarize(
        min = min(value, na.rm = TRUE),
        max = max(value, na.rm = TRUE),
        median = median(value, na.rm = TRUE),
        MAD = mad(value, na.rm = TRUE),
        MAD_lowlim = median - (3 * MAD),
        MAD_highlim = median + (3 * MAD),
        Outliers_low = any(value < MAD_lowlim, na.rm = TRUE),
        Outliers_high = any(value > MAD_highlim, na.rm = TRUE)
      )
    
       variable     Species      min   max median   MAD MAD_lowlim MAD_highlim Outliers_low Outliers_high
       <chr>        <fct>      <dbl> <dbl>  <dbl> <dbl>      <dbl>       <dbl> <lgl>        <lgl>        
     1 Petal.Length setosa       1     1.9   1.5  0.148      1.06         1.94 TRUE         FALSE        
     2 Petal.Length versicolor   3     5.1   4.35 0.519      2.79         5.91 FALSE        FALSE        
     3 Petal.Length virginica    4.5   6.9   5.55 0.667      3.55         7.55 FALSE        FALSE        
     4 Petal.Width  setosa       0.1   0.6   0.2  0          0.2          0.2  TRUE         TRUE         
     5 Petal.Width  versicolor   1     1.8   1.3  0.222      0.633        1.97 FALSE        FALSE        
     6 Petal.Width  virginica    1.4   2.5   2    0.297      1.11         2.89 FALSE        FALSE        
     7 Sepal.Length setosa       4.3   5.8   5    0.297      4.11         5.89 FALSE        FALSE        
     8 Sepal.Length versicolor   4.9   7     5.9  0.519      4.34         7.46 FALSE        FALSE        
     9 Sepal.Length virginica    4.9   7.9   6.5  0.593      4.72         8.28 FALSE        FALSE        
    10 Sepal.Width  setosa       2.3   4.4   3.4  0.371      2.29         4.51 FALSE        FALSE        
    11 Sepal.Width  versicolor   2     3.4   2.8  0.297      1.91         3.69 FALSE        FALSE        
    12 Sepal.Width  virginica    2.2   3.8   3    0.297      2.11         3.89 FALSE        FALSE