Search code examples
rdplyrtidyverse

R: Counting the number of groups having each value of a variable?


Sample data:

tibble::tibble(id = c("101", "101", "101", "102", "102", "103"), 
    color = c("Blue", "Blue", "Red", "Red", "Green", "Green"))

#> # A tibble: 6 × 2
#>   id    color 
#>   <chr> <chr> 
#> 1 101   Blue  
#> 2 101   Blue  
#> 3 101   Red   
#> 4 102   Red   
#> 5 102   Green 
#> 6 103   Green

Created on 2024-09-05 with reprex v2.1.1

I want to count the number of id's which have a certain value of color, for every value of color.

Sample Output:

tibble::tibble(color = c("Blue", "Red", "Green"), number_of_ids_having_color = c("1", 
    "2", "2"))

#> # A tibble: 3 × 2
#>   color number_of_ids_having_color
#>   <chr>       <chr>                           
#> 1 Blue        1                               
#> 2 Red         2                               
#> 3 Green       2

Created on 2024-09-05 with reprex v2.1.1

This is sort of similar to this question, the differences being that (a) my variable to condition on is categorical and (b) I want to make this calculation for every single value of said variable.


As a little bonus question (lower priority), let's say I also have another variable, most_important_color and I'd like to also count the number of id's having each color as their most_important_color:

tibble::tibble(id = c("101", "101", "101", "102", "102", "103"), 
    color = c("Blue", "Blue", "Red", "Red", "Green", "Green"), 
    most_important_color = c("F", "F", "T", "T", "F", "T"))

#> # A tibble: 6 × 3
#>   id    color most_important_color
#>   <chr> <chr>       <chr>               
#> 1 101   Blue        F                   
#> 2 101   Blue        F                   
#> 3 101   Red         T                   
#> 4 102   Red         T                   
#> 5 102   Green       F                   
#> 6 103   Green       T

Created on 2024-09-05 with reprex v2.1.1

The final end product, incorporating the solution to my first (and more important question) would look like this:

tibble::tibble(color_group = c("Blue", "Red", "Green"), number_of_ids_having_color = c("1", 
    "2", "2"), number_of_ids_having_color_as_most_important = c("0", 
    "2", "1"))

#> # A tibble: 3 × 3
#>   color_group number_of_ids_having_color number_of_ids_having_color_as_most_im…¹
#>   <chr>       <chr>                      <chr>                                  
#> 1 Blue        1                          0                                      
#> 2 Red         2                          2                                      
#> 3 Green       2                          1                                      
#> # ℹ abbreviated name: ¹​number_of_ids_having_color_as_most_important

Created on 2024-09-05 with reprex v2.1.1

It's fairly straightforward to do this separately and then join the results of that operation to that of above, but I'm interested if there's a way to do all of this in a single pipe without any need to create intermediate objects and join them.

Tidyverse solutions strongly preferred since I share this code with people who are mostly comfortable with tidyverse.

Thank you!


Solution

  • library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    df_colors <- tibble::tibble(
      id = c("101", "101", "101", "102", "102", "103"), 
      color = c("Blue", "Blue", "Red", "Red", "Green", "Green"),
      most_important_color = c(F, F, T, T, F, T)
    )
    
    df_colors |>
      group_by(color) %>% 
      summarise(
        ids = n_distinct(id),
        most_important = n_distinct(if_else(most_important_color, id, NA_character_), na.rm = T),
        records = n()
      )
    #> # A tibble: 3 × 4
    #>   color   ids most_important records
    #>   <chr> <int>          <int>   <int>
    #> 1 Blue      1              0       2
    #> 2 Green     2              1       2
    #> 3 Red       2              2       2
    

    Created on 2024-09-05 with reprex v2.1.1