Search code examples
rdplyrcount

How to summarise by group and total with missing counts using dplyr?


My question is tangental from this previous question: How to summarise by group AND get a summary of the overall dataset using dplyr in R

Using the reprex from the reply to set up the new question:

library(tidyverse)
set.seed(500)
dat <- tibble(
  treatment = sample(c("Group1", "Group2", "Group3"), 100, replace = TRUE),
  recruitment_strategy = sample(c("Strategy 1", "Strategy 2", "Strategy 3", "Strategy 4", "Strategy 5"), 100, replace = TRUE),
  Variable_A = rnorm(100),
  Variable_B = rnorm(100),
  Variable_C = rnorm(100)
)

Now lets make it so that one treatment x strategy combination does not occur in the data

dat2 <- dat %>% 
    filter(!(recruitment_strategy == "Strategy 1" & treatment == "Group1"))

And run the previous solution:

dat2 %>%
    inner_join(
        x = count(., treatment, recruitment_strategy) %>% spread(treatment, n),
        y = count(., recruitment_strategy, name = "Overall_dataset"),
        by = "recruitment_strategy"
    ) %>%
    mutate_at(
        .vars = vars(-recruitment_strategy),
        .funs = ~ str_glue("{.} ({scales::percent(. / sum(.), accuracy = 1)})")
    )

The group which doesn't occur shows up with NA count:

# A tibble: 5 × 5
  recruitment_strategy Group1  Group2   Group3  Overall_dataset
  <chr>                <glue>  <glue>   <glue>  <glue>         
1 Strategy 1           NA (NA) 13 (30%) 4 (16%) 17 (17%)       
2 Strategy 2           8 (NA)  6 (14%)  6 (24%) 20 (20%)       
3 Strategy 3           6 (NA)  12 (27%) 3 (12%) 21 (21%)       
4 Strategy 4           9 (NA)  4 (9%)   5 (20%) 18 (18%)       
5 Strategy 5           6 (NA)  9 (20%)  7 (28%) 22 (22%)       

My question is how can I make it show 0 count instead of NA? I have tried modifying the count() arguments with .drop = FALSE but it does not make a difference

x = count(., treatment, recruitment_strategy, .drop = FALSE) %>% spread(treatment, n),

Any other ideas anyone? Thx


Solution

  • Ok I found my own solution. The trick is to make sure recruitment_strategy is a factor variable before counting with .drop = FALSE also set. Also, I realised spread is a bit outdated so I've changed it to use pivot_wider instead also:

    dat2 %>%
        mutate( recruitment_strategy = as.factor( recruitment_strategy)) %>% 
        inner_join(
            x = count(., treatment, recruitment_strategy, .drop = FALSE) %>% 
                pivot_wider(id_cols = recruitment_strategy,
                            names_from = treatment, 
                            values_from = n),
            y = count(., recruitment_strategy, name = "Overall_dataset"),
            by = "recruitment_strategy"
        ) %>%
        mutate_at(
            .vars = vars(-recruitment_strategy),
            .funs = ~ str_glue("{.} ({scales::percent(. / sum(.), accuracy = 1)})")
        )
    
    # A tibble: 5 × 5
      recruitment_strategy Group1  Group2   Group3  Overall_dataset
      <fct>                <glue>  <glue>   <glue>  <glue>         
    1 Strategy 1           0 (0%)  13 (30%) 4 (16%) 17 (17%)       
    2 Strategy 2           8 (28%) 6 (14%)  6 (24%) 20 (20%)       
    3 Strategy 3           6 (21%) 12 (27%) 3 (12%) 21 (21%)       
    4 Strategy 4           9 (31%) 4 (9%)   5 (20%) 18 (18%)       
    5 Strategy 5           6 (21%) 9 (20%)  7 (28%) 22 (22%)