Search code examples
rdplyrsummarizeacross

summarize across -- is it order dependent?


I came across something weird with dplyr and across, or at least something I do not understand.

If we use the across function to compute the mean and standard error of the mean across multiple columns, I am tempted to use the following command:

mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
  summarize(across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}"),
            across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}")) %>% head()

Which results in

   gear   mpg   cyl se_mpg se_cyl
  <dbl> <dbl> <dbl>  <dbl>  <dbl>
1     3  16.1  7.47     NA     NA
2     4  24.5  4.67     NA     NA
3     5  21.4  6        NA     NA

However, if I switch the order of the individual across commands, I get the following:

mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
  summarize(across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}"),
            across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}")) %>% head()

# A tibble: 3 x 5
   gear se_mpg se_cyl   mpg   cyl
  <dbl>  <dbl>  <dbl> <dbl> <dbl>
1     3  0.871  0.307  16.1  7.47
2     4  1.52   0.284  24.5  4.67
3     5  2.98   0.894  21.4  6   

Why is this the case? Does it have something to do with my usage of everything()? In my situation I'd like the mean and the standard error of the mean calculated across every variable in my dataset.


Solution

  • I have no idea why summarize behaves like that, it's probably due to an underlying interaction of the two across functions (although it seems weird to me). Anyway, I suggest you to write a single across statement and use a list of lambda functions as suggested by the across documentation.

    In this way it doesn't matter if the mean or the standard deviation is specified as first function, you will get no NAs.

    mtcars %>% 
      group_by(gear) %>% 
      select(mpg, cyl) %>% 
      summarize(across(everything(), list(
        mean = ~mean(.x, na.rm = TRUE),
        se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x)))
      ), .names = "{fn}_{col}"))
    
    # A tibble: 3 x 5
    #    gear mean_mpg se_mpg mean_cyl se_cyl
    #   <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
    # 1     3     16.1  0.871     7.47  0.307
    # 2     4     24.5  1.52      4.67  0.284
    # 3     5     21.4  2.98      6     0.894
    
    
    
    mtcars %>% 
      group_by(gear) %>% 
      select(mpg, cyl) %>% 
      summarize(across(everything(), list(
        se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x))),
        mean = ~mean(.x, na.rm = TRUE)
      ), .names = "{fn}_{col}"))
    
    # A tibble: 3 x 5
    #    gear se_mpg mean_mpg se_cyl mean_cyl
    #  <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
    # 1     3  0.871     16.1  0.307     7.47
    # 2     4  1.52      24.5  0.284     4.67
    # 3     5  2.98      21.4  0.894     6