Search code examples
rdplyrdata-wranglingsummarize

Why does the order of functions within summarise() affect its output?


When I use two functions within dplyr::summarise(), the ordering of the functions affects the output. While this post shows this can happen when the first function affects the columns the second function operates on (suggesting each function is processed sequentially), this isn't the case in my example.

In the below example of data with 2 rows per year, I calculate rows per year (using n() within summarise()) after counting the number of missing values for each variable. This produces the correct number of 2 rows per year. However, if I count the rows before I count the missing values, the output shows 0 rows per year.

Why does this produce two different results? And if it is indeed because of some sequential processing I'm overlooking, is it then possible to use the output of one of the functions used within a single summarise() call as an input to another function used within the same call?

library(dplyr)   

# Example data with 2 rows per year
df <- data.frame(var1 = rep(c(NA,NA,5),2),
                 var2 = rep(c(1,NA,2),2),
                 year = rep(1:3, 2))

# Approaches to counting number of missing values AND total rows for each year:

# Approach 1: calculate rows per group second, CORRECTLY shows 2 rows per year 
df %>%
  group_by(year) %>%
  summarise(across(everything(), ~ sum(is.na(.x))),
            rows_per_year = n())

#>    year  var1  var2 rows_per_year
#>   <int> <int> <int>         <int>
#> 1     1     2     0             2
#> 2     2     2     2             2
#> 3     3     0     0             2

# Approach 2: calculate rows per group first, INCORRECTLY shows 0 rows per year
df %>%
  group_by(year) %>%
  summarise(rows_per_year = n(), 
            across(everything(), ~ sum(is.na(.x))))

#>    year rows_per_year  var1  var2
#>   <int>         <int> <int> <int>
#> 1     1             0     2     0
#> 2     2             0     2     2
#> 3     3             0     0     0

Solution

  • In the second one, "everything" now includes the "rows_per_year" field. As such, the value that it's using is based on the sum(is.na(.x)) function instead of the n() function.

    A clear example of this would be

    df %>%
        group_by(year) %>%
        summarise(rows_per_year = n(), 
                  rows_per_year = rows_per_year + 1)
    

    If I've understood you correctly, the answer to your final question is "yes" and you've done it here, presumably by accident.