I am trying to work with a many-to-many mapping, finding subsets of one set that map to specific subsets of the other set.
I have many genes. Each gene is a member of one or more COGs (and vice versa), eg.
I have a short set of COGs that represents an enzyme, eg. COG1,COG273.
I want to find all sets of genes that between them have membership of every COG in the enzyme, but without unnecessary overlaps (in this case, for instance, 'gene1 and gene6' would be spurious as gene6 is already a member of both COGs).
In this example, the answers would be:
Although I could get all members of each COG and create a 'product', this would contain spurious results (as mentioned above) where more genes than necessary are in the set.
My mappings are currently contained in a dictionary where the key is the gene ID and the value is a list of the COG IDs of which that gene is a member. However I accept that this might not be the best way to have the mapping stored.
Thanks for the suggestions, they have inspired me to hack something together using recursion. I want to deal with arbitrary gene-cog relationships, so it needs to be a general solution. This should yield all sets of genes (enzymes) that between them are members of all required COGs, without duplicate enzymes and without redundant genes:
def get_enzyme_cogs(enzyme, gene_cog_dict):
"""Get all COGs of which there is at least one member gene in the enzyme."""
cog_list = []
for gene in enzyme:
cog_list.extend(gene_cog_dict[gene])
return set(cog_list)
def get_gene_by_gene_cogs(enzyme, gene_cog_dict):
"""Get COG memberships for each gene in enzyme."""
cogs_list = []
for gene in enzyme:
cogs_list.append(set(gene_cog_dict[gene]))
return cogs_list
def add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict, proposed_enzyme = None, fulfilled_cogs = None):
"""Generator for all enzymes with membership of all target_enzyme_cogs, without duplicate enzymes or redundant genes."""
base_enzyme_genes = proposed_enzyme or []
fulfilled_cogs = get_enzyme_cogs(base_enzyme_genes, target_enzyme_cogs, gene_cog_dict)
## Which COG will we try to find a member of?
next_cog_to_fill = sorted(list(target_enzyme_cogs-fulfilled_cogs))[0]
gene_members_of_cog = cog_gene_dict[next_cog_to_fill]
for gene in gene_members_of_cog:
## Check whether any already-present gene's COG set is a subset of the proposed gene's COG set, if so skip addition
subset_found = False
proposed_gene_cogs = set(gene_cog_dict[gene]) & target_enzyme_cogs
for gene_cogs_set in get_gene_by_gene_cogs(base_enzyme_genes, target_enzyme_cogs, gene_cog_dict):
if gene_cogs_set.issubset(proposed_gene_cogs):
subset_found = True
break
if subset_found:
continue
## Add gene to proposed enzyme
proposed_enzyme = deepcopy(base_enzyme_genes)
proposed_enzyme.append(gene)
## Determine which COG memberships are fulfilled by the genes in the proposed enzyme
fulfilled_cogs = get_enzyme_cogs(proposed_enzyme, target_enzyme_cogs, gene_cog_dict)
if (fulfilled_cogs & target_enzyme_cogs) == target_enzyme_cogs:
## Proposed enzyme has members of every required COG, so yield
enzyme = deepcopy(proposed_enzyme)
proposed_enzyme.remove(gene)
yield enzyme
else:
## Proposed enzyme is still missing some COG members
for enzyme in add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict, proposed_enzyme, fulfilled_cogs):
yield enzyme
Input:
gene_cog_dict = {'gene1':['COG1','COG1003'], 'gene2':['COG2'], 'gene3':['COG273'], 'gene4':['COG1'], 'gene5':['COG273','COG71'], 'gene6':['COG1','COG273']}
cog_gene_dict = {'COG2': ['gene2'], 'COG1': ['gene1', 'gene4', 'gene6'], 'COG71': ['gene5'], 'COG273': ['gene3', 'gene5', 'gene6'], 'COG1003': ['gene1']}
target_enzyme_cogs = ['COG1','COG273']
Usage:
for enzyme in add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict):
print enzyme
Output:
['gene1', 'gene3']
['gene1', 'gene5']
['gene4', 'gene3']
['gene4', 'gene5']
['gene6']
I have no idea about its performance though.