Search code examples
rggplot2dplyrpurrrforcats

summarizing a list of character vectors using forcats and purrr


I have tibble where col1 is a list of character vectors of variable length and col2 is a numeric vector indicating a group assignment, either 1 or 0. I want to first convert all of the character vectors in the list (col1) to factors, and then unify all of the factors levels across these factors so that I can ultimately get a tally of counts for each factor level. For the example data below, that would mean the tally would be as follows:

overall:

    level, count  
    "a", 2
    "b", 2
    "c", 2
    "d", 3
    "e", 1

for group=1:

    level, count  
    "a", 1
    "b", 2
    "c", 1
    "d", 1
    "e", 0

for group=0:

    level, count  
    "a", 1
    "b", 0
    "c", 1
    "d", 2
    "e", 1

The ultimate goal is to be able to get a total count of each factor level c("a","b","c","d","e") and plot them by the grouping variable.

Here is some code that might give better context to my problem:

library(forcats)
library(purrr)
library(dplyr)
library(ggplot2)

tib <- tibble(col1=list(c("a","b"),
                 c("b","c","d"), 
                 c("a","d","e"),
                 c("c","d")),
       col2=c(1,1,0,0))


tib %>% 
  mutate(col3=map(.$col1,.f = as_factor)) %>% 
  mutate(col4=map(.$col3,.f = fct_unify))

Unfortunately, this code fails. I get the following error, but don't know why:

Error:fsmust be a list

I thought my input was a list?

I appreciate any help anyone might offer. Thanks.


Solution

  • You can first unnest and then count

    library(dplyr)
    library(tidyr)
    
    tib %>%
      unnest(col = col1) %>%
      #If needed col1 as factors
      #mutate(col1 =factor(col1)) %>%
      count(col1)
    
    #  col1      n
    #  <fct> <int>
    #1 a         2
    #2 b         2
    #3 c         2
    #4 d         3
    #5 e         1
    

    To count based on group i.e col2, we can do

    tib %>% 
      unnest(col = col1) %>% 
      mutate_at(vars(col1, col2), factor) %>%
      count(col1, col2, .drop = FALSE)
    
    #   col1  col2      n
    #   <fct> <fct> <int>
    # 1 a     0         1
    # 2 a     1         1
    # 3 b     0         0
    # 4 b     1         2
    # 5 c     0         1
    # 6 c     1         1
    # 7 d     0         2
    # 8 d     1         1
    # 9 e     0         1
    #10 e     1         0