Sample data:
tibble::tibble(id = c("101", "101", "101", "102", "102", "103"),
color = c("Blue", "Blue", "Red", "Red", "Green", "Green"))
#> # A tibble: 6 × 2
#> id color
#> <chr> <chr>
#> 1 101 Blue
#> 2 101 Blue
#> 3 101 Red
#> 4 102 Red
#> 5 102 Green
#> 6 103 Green
Created on 2024-09-05 with reprex v2.1.1
I want to count the number of id
's which have a certain value of color
, for every value of color
.
Sample Output:
tibble::tibble(color = c("Blue", "Red", "Green"), number_of_ids_having_color = c("1",
"2", "2"))
#> # A tibble: 3 × 2
#> color number_of_ids_having_color
#> <chr> <chr>
#> 1 Blue 1
#> 2 Red 2
#> 3 Green 2
Created on 2024-09-05 with reprex v2.1.1
This is sort of similar to this question, the differences being that (a) my variable to condition on is categorical and (b) I want to make this calculation for every single value of said variable.
As a little bonus question (lower priority), let's say I also have another variable, most_important_color
and I'd like to also count the number of id
's having each color as their most_important_color
:
tibble::tibble(id = c("101", "101", "101", "102", "102", "103"),
color = c("Blue", "Blue", "Red", "Red", "Green", "Green"),
most_important_color = c("F", "F", "T", "T", "F", "T"))
#> # A tibble: 6 × 3
#> id color most_important_color
#> <chr> <chr> <chr>
#> 1 101 Blue F
#> 2 101 Blue F
#> 3 101 Red T
#> 4 102 Red T
#> 5 102 Green F
#> 6 103 Green T
Created on 2024-09-05 with reprex v2.1.1
The final end product, incorporating the solution to my first (and more important question) would look like this:
tibble::tibble(color_group = c("Blue", "Red", "Green"), number_of_ids_having_color = c("1",
"2", "2"), number_of_ids_having_color_as_most_important = c("0",
"2", "1"))
#> # A tibble: 3 × 3
#> color_group number_of_ids_having_color number_of_ids_having_color_as_most_im…¹
#> <chr> <chr> <chr>
#> 1 Blue 1 0
#> 2 Red 2 2
#> 3 Green 2 1
#> # ℹ abbreviated name: ¹number_of_ids_having_color_as_most_important
Created on 2024-09-05 with reprex v2.1.1
It's fairly straightforward to do this separately and then join the results of that operation to that of above, but I'm interested if there's a way to do all of this in a single pipe without any need to create intermediate objects and join them.
Tidyverse solutions strongly preferred since I share this code with people who are mostly comfortable with tidyverse.
Thank you!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df_colors <- tibble::tibble(
id = c("101", "101", "101", "102", "102", "103"),
color = c("Blue", "Blue", "Red", "Red", "Green", "Green"),
most_important_color = c(F, F, T, T, F, T)
)
df_colors |>
group_by(color) %>%
summarise(
ids = n_distinct(id),
most_important = n_distinct(if_else(most_important_color, id, NA_character_), na.rm = T),
records = n()
)
#> # A tibble: 3 × 4
#> color ids most_important records
#> <chr> <int> <int> <int>
#> 1 Blue 1 0 2
#> 2 Green 2 1 2
#> 3 Red 2 2 2
Created on 2024-09-05 with reprex v2.1.1