Search code examples
rdataframereshape

Count occurrences of factors across multiple columns in grouped dataframe


I have the following dataframe and want to group by the grp column to see how many of each column-value appears in each group.

> data.frame(grp = unlist(strsplit("aabbccca", "")), col1=unlist(strsplit("ABAABBAB", "")), col2=unlist(strsplit("BBCCCCDD", "")))
  grp col1 col2
1   a    A    B
2   a    B    B
3   b    A    C
4   b    A    C
5   c    B    C
6   c    B    C
7   c    A    D
8   a    B    D

Desired result:

  grp col1A col1B col2B col2C col2D
1   a    1    2     2     0     1
2   b    2    0     0     2     0
3   c    1    2     0     2     1

If I only look at the grp and col1 columns, it is easy to solve this using table() and when there are only 2 columns, I could merge table(df[c('grp', 'col1')]) with table(df[c('grp', 'col2')]). However, this gets extremely cumbersome as the number of factor columns grows, and is problematic if there are shared values between col1 and col2.

Note that dplyr's count doesn't work, as it looks for unique combinations of the col1 and col2.

I've tried melting and spreading the dataframe using tidyr, without any luck

> pivot_longer(df, c(col1, col2), names_to= "key", values_to = "val") %>% pivot_wider("grp", names_from = c("key", "val"), values_from = 1, values_fn = sum)
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Column `grp` doesn't exist.

I can find plenty of solutions that work for the case where I have 1 group column and 1 value column, but I can't figure out how to generalize them to more columns.


Solution

  • You can stack col1 & col2 together, count the number of each combination, and then transform the table to a wide form.

    library(dplyr)
    library(tidyr)
    
    df %>%
      pivot_longer(col1:col2) %>%
      count(grp, name, value) %>%
      pivot_wider(grp, names_from = c(name, value), names_sort = TRUE,
                  values_from = n, values_fill = 0)
    
    # A tibble: 3 x 6
      grp   col1_A col1_B col2_B col2_C col2_D
      <chr>  <int>  <int>  <int>  <int>  <int>
    1 a          1      2      2      0      1
    2 b          2      0      0      2      0
    3 c          1      2      0      2      1
    

    A base solution (Thank @GKi to refine the code):

    table(cbind(df["grp"], col=do.call(paste0, stack(df[-1])[2:1])))
    
       col
    grp col1A col1B col2B col2C col2D
      a     1     2     2     0     1
      b     2     0     0     2     0
      c     1     2     0     2     1