Search code examples
rdplyrtidyrlevelstally

Include empty factor levels in tally with tidyr and dplyr


a question as a learn dplyr and its ilk.

I am calculating a tally and a relative frequency of a factor conditioned on two other variables in a df. For instance:

library(dplyr)
library(tidyr)
set.seed(3457)
pct <- function(x) {x/sum(x)}
foo <- data.frame(x = rep(seq(1:3),20),
                  y = rep(rep(c("a","b"),each=3),10),
                  z = LETTERS[floor(runif(60, 1,5))])
bar <- foo %>%
group_by(x, y, z) %>%
tally %>%
mutate(freq = (n / sum(n)) * 100)
head(bar)

I'd like the output, bar, to include all the levels of foo$z. I.e., there are no cases of C here:

subset(bar, x==2 & y=="a")   

How can I have bar tally the missing levels so I get:

subset(bar, x==2 & y=="a",select = n) 

to return 4, 5, 0, 1 (and select = freq to give 40, 50, 0, 10)?

Many thanks.

Edit: Ran with the seed set!


Solution

  • We can use complete from tidyr

    bar1 <- bar %>%
               complete(z, nesting(x, y), fill = list(n = 0, freq = 0))%>%
               select_(.dots = names(bar))
    filter(bar1, x==2 & y=="a")   
    #      x      y      z     n  freq
    #   <int> <fctr> <fctr> <dbl> <dbl>
    #1     2      a      A     4    40
    #2     2      a      B     5    50
    #3     2      a      C     0     0
    #4     2      a      D     1    10