Search code examples
rcombinationsoverlap

find most frequent subset in dataframe


So I have the following dataframe:

and what I want is to find the combination of genes that are present together the most.

  sample genea geneb genec gened genee genef
1      1     1     1     1     0     0     0
2      2     1     1     1     0     0     0
3      3     1     0     0     1     1     1
4      4     0     0     0     0     0     0
5      5     1     0     1     1     1     1
6      6     0     0     0     0     0     0

so in this case, my desired output would be: gene a + c = 3 samples overlap.

test[sort.list(colSums(test[,-1]), decreasing=TRUE)[1:15] +1]) gives me a list with most 1 values per gene. But I am getting stuck with this.

How do I approach this.


Solution

  • One way would be to use crossprod():

    library(tidyr)
    library(dplyr)
    
    dat %>%
      pivot_longer(-sample) %>%
      filter(value == 1) %>%
      select(-value) %>%
      table() %>%
      crossprod() %>%
      replace(lower.tri(., diag = TRUE), NA) %>%
      as.data.frame.table() %>%
      slice_max(Freq)
    
       name name.1 Freq
    1 genea  genec    3