Search code examples
rdplyrdatatabledata-manipulation

Counting the Number of Combinations of Unique Values Per ID


id = sample.int(50,1000, replace = TRUE)
years <- c("2010", "2011", "2013", "2014", "2015")
year<- sample(years, 1000, prob=c(0.2,0.2,0.2,0.2, 0.2), replace = TRUE)
my_data =  data.frame(id, year)
my_data <- my_data[order(id),]

For example, patient # 1 visited the hospital twice in 2010, once in 2011, no times in 2012 ... patient # 2 visited the hospital no times in 2010, 5 times in 2011, 3 times in 2012, etc.

For this dataset, I want to find out the number of the number of times each combination of "years" appears (not including duplicates). For example - since there are 5 possible years, there will be 2^5 = 32 possible combinations :

# sample output of the final result
                    combinations frequency
                           2010        11
                           2011         9
                           2012         5
                           2013         1
                           2014        19
                           2015        11
                      2011,2012         9
               2011, 2012, 2013         5
                     2013, 2015         1
  2010,2011,2012,2013,2014,2015        19

This would be the equivalent of finding out:

  • How many patients visited the hospital only in 2010?
  • How many patients visited the hospital in 2010 and 2013?
  • etc.

I tried to find a way to do this different ways:

# Method 1: Did not work
library(data.table)
final = setDT(my_data)[,list(Count=.N) ,names(my_data)]

# Method 2: Did not work

final =  = aggregate(.~year,rev(aggregate(.~id,my_data,paste0,collapse = ",")),length) 

# Method 3: Not sure if this is correct? There are 50 rows (instead of 32) and all frequency counts are 1?

library(dplyr)
library(stringr)
df = my_data %>% 
    group_by(id) %>% 
    summarise(years = str_c(year, collapse=",")) %>% 
    count(years)

My Question: Can someone please show me how to fix this?


Solution

  • In base R:

    table(tapply(my_data$year, my_data$id, 
                 function(x) paste(sort(unique(x)), collapse=',')))