Search code examples
rdplyrsubsetcountingsummarize

How to use R dplyr's summarize to count the number of rows that match a criteria?


I have a dataset that I want to summarize. First, I want the sum of the home and away games, which I can do. However, I also want to know how many outliers (defined as more than 300 points) are within each subcategory (home, away).

If I wasn't using summarize, I know dplyr has the count() function, but I'd like this solution to appear in my summarize() call. Here's what I have and what I've tried, which fails to perform:

#Test data
library(dplyr)

test <- tibble(score = c(100, 150, 200, 301, 150, 345, 102, 131),
                  location = c("home", "away", "home", "away", "home", "away", "home", "away"),
                  more_than_300 = c(FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE))


#attempt 1, count rows that match a criteria
test %>%
  group_by(location) %>%
  summarize(total_score = sum(score),
            n_outliers = nrow(.[more_than_300 == FALSE]))

Solution

  • You can use sum on logical vectors - it will automatically convert them into numeric values (TRUE being equal to 1 and FALSE being equal to 0), so you need only do:

    test %>%
      group_by(location) %>%
      summarize(total_score = sum(score),
                n_outliers  = sum(more_than_300))
    #> # A tibble: 2 x 3
    #>   location total_score n_outliers
    #>   <chr>          <dbl>      <int>
    #> 1 away             927          2
    #> 2 home             552          0
    

    Or, if these are your only 3 columns, an equivalent would be:

    test %>%
      group_by(location) %>%
      summarize(across(everything(), sum))
    

    In fact, you don't need to make the more_than_300 column - it would suffice to do:

    test %>%
      group_by(location) %>%
      summarize(total_score = sum(score),
                n_outliers  = sum(score > 300))