Search code examples
rsummary

Summarize a factor variable by combinations of second factor variable


My data look like this

set.seed(89)
d <- data.frame(
  ID=seq(1, 100),
  Encounter=sample(c(1:50), 100, replace = TRUE), 
  EffortType=sample(c("A","B","C"), 100, replace = TRUE)
)

I consider the Encounter variable as a factor.

I would like to know the frequencies of the possible combinations of EffortType.

I would like the results to look something like this

EffortType      N
A               8
B               8
C               9
A,B             4
A,C             8
B,C             5
A,B,C           3

I would also like to then be able to subset the data by the EffortType combinations. For example, I would end up with a subset for EffortType A,B that looks something like this

ID  Encounter    EffortType    
52  2            A
53  2            B
61  2            A
63  2            A
79  2            A
36  7            B
59  7            B
83  7            A
etc.

I did try to reshape the data such that I had separate variables for each level of EffortType using "mutate", and then tried to count up the instances of each combination, but am not doing this correctly as shown below. I can't figure out how to "group" by encounter before doing the counting.

d = mutate(d, 
              A = ifelse(grepl("A", EffortType), T, F),
              B = ifelse(grepl("B", EffortType), T, F),
              C = ifelse(grepl("C", EffortType), T, F))

d = data.table(d)
d[, .N, by = c('Encounter', 'A', 'B', 'C')]

But I don't end up with the summary I'm hoping for. Please help. Thx.


Solution

  • I would make a separate table for encounter attributes:

    library(data.table)
    EncounterDT = d[, 
      .(tt = paste(sort(unique(EffortType)), collapse=" "))
    , keyby=Encounter]
    
    # count encounters by types
    EncounterDT[, .N, keyby=tt][order(nchar(tt), tt)]
    
    # subset d using a join
    d[EncounterDT[tt == "A B", .(Encounter)], on=.(Encounter)]
    

    If you have a strong preference for using a single table, though...

    # add a repeating-value column
    d[, tt := paste(sort(unique(EffortType)), collapse=" "), by=Encounter]
    
    # count encounters by types
    d[, uniqueN(Encounter), keyby=tt][order(nchar(tt), tt)]
    
    # subset d based using the tt column
    d[tt == "A B"]