I've been performing phylogenetic analysis in R for a while, employing libraries like ape, phangorn and phytools.
While solving a problem, I've come to a presence/absence data.frame that specifies if genes of interest belong (or don't) to a certain group.
An example of this would be:
gene11 gene25 gene33 gene54 gene55 gene65 gene73 gene88
group_1 1 1 0 0 0 0 0 0
group_2 1 1 1 0 0 0 0 0
group_3 1 0 1 0 0 0 0 0
group_4 0 1 1 0 0 0 0 0
group_5 0 0 0 1 1 0 0 0
group_6 0 0 0 1 0 0 0 0
group_7 0 0 0 0 1 0 0 0
group_8 0 0 0 0 0 1 1 1
group_9 0 0 0 0 0 1 1 0
group_10 0 0 0 0 0 1 0 1
group_11 0 0 0 0 0 0 1 1
As expected when dealing with groups of biological entities, there many ways in which this entities relate: genes 11, 25 and 33 form a group, and also their relationships could be described smaller groups, depicting pairwise relationships.
So here is the important thing: group_2, group_5 and group_8 are the biologically relevant groups of genes, and they aren't known beforehand as the relevant groups. The other, smaller groups, arise as a consequence of the relationship shown in these relevant groups: group_1 relates gene11 and gene25, but is a group that is nested in the broader (and relevant) group_2. The same applies in the other cases: group_8 depicts a relationship between gene65, gene73 and gene88; the other groups concerning these genes (group_9, group_10 and group_11) are only subgroups depicting the pairwise relationships existing among the genes that are part of the broader group group_8.
What is known beforehand is that genes form clusters of disjoint groups, each cluster being composed of other (progressively smaller) clusters. I'm interested in capturing the biggest-disjoint groups.
A clear definition of the problem was done by another user (@Shree):
Find minimum number of groups such that all other groups are a sub-group of at least one of those groups. Also a group has to have at least 2 genes i.e. two 1s in a row. Also assuming, 1,01,0 is a subgroup of
1,1,1,0
but0,1,1,1
is not a subgroup of1,1,1,0
.
Thanks to all in advance!
Here's a way using mixed integer programming approach. I am using ompr
for mathematical modeling and glpk
(free open source) as solver. Modeling logic is provided as comments in code.
I think the problem can be mathematically described as follows -
Filter dataframe to minimize number of rows such that sum of all columns is 1. Selected rows are called primary groups and every other row should be a subgroup of a primary group. A column (gene) can belong to only one primary group. Any unselected row is a subgroup of a primary group when
subgroup <= primary group
at all positions (columns). Therefore,(0,0,1,1)
is subgroup of(0,1,1,1)
but(1,0,1,1)
is not a subgroup of(0,1,1,1)
.
library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)
gene_mat <- as.matrix(df)
nr <- nrow(gene_mat)
nc <- ncol(gene_mat)
model <- MIPModel() %>%
# binary variable x[i] is 1 if row i is selected else 0
add_variable(x[i], i = 1:nr, type = "binary") %>%
# minimize total rows selected
set_objective(sum_expr(x[i], i = 1:nr), "min") %>%
# sum of columns of selected rows must be = 1
add_constraint(sum_expr(gene_mat[i,j]*x[i], i = 1:nr) == 1, j = 1:nc) %>%
solve_model(with_ROI(solver = "glpk"))
# get rows selected
group_rows <- model %>%
get_solution(x[i]) %>%
filter(value > 0) %>%
pull(i) %>%
print()
result <- df[group_rows, ]
gene11 gene25 gene33 gene54 gene55 gene65 gene73 gene88
group_2 1 1 1 0 0 0 0 0
group_5 0 0 0 1 1 0 0 0
group_8 0 0 0 0 0 1 1 1
Important Note -
The above formulation does not address subgroup <= primary group
but instead relies on the fact that OP mentions "What is known beforehand is that genes form clusters of disjoint groups". This means cases like shown below do not exist in data since rows 1,3,4 do not form disjoint groups i.e column 3 would belong to 2 primary groups which is not allowed.
1 1 0 0 0
0 1 0 0 0
1 0 1 0 0 <- this row is not a subgroup of any row
0 0 1 1 1
Anyways, here's code to do a safety check to make sure all unselected rows are subgroup of only one primary group -
test <- lapply(group_rows, function(x) {
sweep(df, 2, as.numeric(df[x, ]), "<=") %>%
{which(rowSums(.) == ncol(df))}
})
# all is okay if below returns TRUE
length(Reduce(intersect, test)) == 0
Data -
df <- structure(list(
gene11 = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L),
gene25 = c(1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
gene33 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
gene54 = c(0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L),
gene55 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L),
gene65 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L),
gene73 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L),
gene88 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L)),
class = "data.frame",
row.names = c("group_1", "group_2", "group_3", "group_4",
"group_5", "group_6", "group_7", "group_8",
"group_9", "group_10", "group_11")
)