I successfully used pandas to create a Pearson correlation matrix. From this matrix, I was able to extract many pairs of genes that correlated more than a certain threshold (0.75 in this instance) and then saved each pair as a tuple within a large list.
How do I now go about through this list to generate every possible correlated gene combination that's more than just pairs? For example:
Lets say there are 4 genes: A, B, C, and D. Each gene is correlated with the other (and itself obviously). This means somewhere in the big list there are the following 10 separate tuple pairs [(A,A), (B,B), (C,C), (D,D), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D)]. However, from this you could also now create the tuples (A, B, C) and (A, B, C, D) and (B, C, D) and (A, C, D) and (A, B, D) and so on since they all correlate with each other. How do I go about writing a function that can create every combination of these new groups using just the pairs I currently have saved in my list?
By the way, if gene A correlated with gene B, and gene B correlated with gene C, but gene A did not correlate with gene C, they would NOT form the group (A, B, C). Everything has to correlate with each other to form a group.
The number of tuple pairs is in the millions, so finding an efficient way to get this done would be ideal (but not essential).
I am honestly not sure where I can begin. I was recommended to use a function that would give me all subsets of an array but that doesn't really help with the overall issue. Here are some possible steps I thought of that would technically get me what I want but would be extremely inefficient.
Sample Input: [(A,A), (B,B), (C,C), (D,D), (E,E), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D), (C,E)]
Sample Output: [(A,A), (B,B), (C,C), (D,D), (E,E), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D), (C,E), (A,B,C,D), (A,B,C), (B,C,D), (A,C,D), (A,B,D)]
Note how E isn't found in any of the new combinations since it only correlates with itself and C so it can't be included in any other group.
This falls into the class of problems known as clique problems.
Specifically you want to list all maximal cliques of the graph whose vertices are your genes and whose edges are your corellated pairs.
According to wikipedia the most efficient algorithm in pratice is the Bron–Kerbosch algorithm (or its variations) from 1973.
Here is an implementation using find_cliques from the networkx package. It uses the Bron and Kerbosch algorighm.
editors note: It doesn't return the expected output, but this is a fascinating bit of information. I don't know enough about graph theory to validate if this should work or not or what needs to change to make it work. If the original poster would fix this or provide guidance that would be superb.
import networkx as nx
from networkx.algorithms.clique import find_cliques as maximal_cliques
genes = ["A", "B", "C", "D", "E"]
pairs = [
("A", "A"),
("B", "B"),
("C", "C"),
("D", "D"),
("E", "E"),
("A", "B"),
("A", "C"),
("A", "D"),
("B", "C"),
("B", "D"),
("C", "D"),
("C", "E"),
]
G = nx.Graph()
G.add_nodes_from(genes)
G.add_edges_from(pairs)
print(list(maximal_cliques(G)))
This yields:
[['C', 'D', 'B', 'A'], ['C', 'E']]
This package is mentioned in this related answer.
I was slightly of in my original awnser: The desired result as specified by OP is the set of all complete subgraphs (i.e. cliques) not just the maximal ones. To go from the maximal cliques to every clique one simply has to add all possible subsets of each maximal clique. As every subgraph of a maximal complete subgraph of G is also itself a complete subgraph of G.
Or one can just use enumerate_all_cliques
from the networkx
package.
import networkx as nx
from networkx.algorithms.clique import enumerate_all_cliques as all_cliques
pairs = [
# the self referencing edges (e.g. [A,A]) are not needed
("A", "B"),
("A", "C"),
("A", "D"),
("B", "C"),
("B", "D"),
("C", "D"),
("C", "E"),
]
G = nx.Graph()
# one does not have to explicilty add the vertices as long as they are part of an edge
G.add_edges_from(pairs)
print(list(all_cliques(G)))
This yields:
[['A'], ['B'], ['C'], ['D'], ['E'], ['A', 'B'], ['A', 'C'], ['A', 'D'], ['B', 'C'], ['B', 'D'], ['C', 'D'], ['C', 'E'], ['A', 'B', 'C'], ['A', 'B', 'D'], ['A', 'C', 'D'], ['B', 'C', 'D'], ['A', 'B', 'C', 'D']]
Strictly speeking this is still not quite what was requested, as the self-correlation entries are represented as [A] instead of [A,A] but that can easly be fixed with some postprocessing.