So I have the following dataframe:
and what I want is to find the combination of genes that are present together the most.
sample genea geneb genec gened genee genef
1 1 1 1 1 0 0 0
2 2 1 1 1 0 0 0
3 3 1 0 0 1 1 1
4 4 0 0 0 0 0 0
5 5 1 0 1 1 1 1
6 6 0 0 0 0 0 0
so in this case, my desired output would be: gene a + c = 3 samples overlap.
test[sort.list(colSums(test[,-1]), decreasing=TRUE)[1:15] +1])
gives me a list with most 1 values per gene. But I am getting stuck with this.
How do I approach this.
One way would be to use crossprod()
:
library(tidyr)
library(dplyr)
dat %>%
pivot_longer(-sample) %>%
filter(value == 1) %>%
select(-value) %>%
table() %>%
crossprod() %>%
replace(lower.tri(., diag = TRUE), NA) %>%
as.data.frame.table() %>%
slice_max(Freq)
name name.1 Freq
1 genea genec 3