Search code examples
pythonsetmany-to-many

Python: Find subsets across many-to-many mapping


I am trying to work with a many-to-many mapping, finding subsets of one set that map to specific subsets of the other set.

I have many genes. Each gene is a member of one or more COGs (and vice versa), eg.

  • gene1 is member of COG1
  • gene1 is member of COG1003
  • gene2 is member of COG2
  • gene3 is member of COG273
  • gene4 is member of COG1
  • gene5 is member of COG273
  • gene5 is member of COG71
  • gene6 is member of COG1
  • gene6 is member of COG273

I have a short set of COGs that represents an enzyme, eg. COG1,COG273.

I want to find all sets of genes that between them have membership of every COG in the enzyme, but without unnecessary overlaps (in this case, for instance, 'gene1 and gene6' would be spurious as gene6 is already a member of both COGs).

In this example, the answers would be:

  • gene1 and gene3
  • gene1 and gene5
  • gene3 and gene4
  • gene4 and gene5
  • gene6

Although I could get all members of each COG and create a 'product', this would contain spurious results (as mentioned above) where more genes than necessary are in the set.

My mappings are currently contained in a dictionary where the key is the gene ID and the value is a list of the COG IDs of which that gene is a member. However I accept that this might not be the best way to have the mapping stored.


Solution

  • Thanks for the suggestions, they have inspired me to hack something together using recursion. I want to deal with arbitrary gene-cog relationships, so it needs to be a general solution. This should yield all sets of genes (enzymes) that between them are members of all required COGs, without duplicate enzymes and without redundant genes:

    def get_enzyme_cogs(enzyme, gene_cog_dict):
        """Get all COGs of which there is at least one member gene in the enzyme."""
        cog_list = []
        for gene in enzyme:
            cog_list.extend(gene_cog_dict[gene])
        return set(cog_list)
    
    def get_gene_by_gene_cogs(enzyme, gene_cog_dict):
        """Get COG memberships for each gene in enzyme."""
        cogs_list = []
        for gene in enzyme:
            cogs_list.append(set(gene_cog_dict[gene]))
        return cogs_list
    
    def add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict, proposed_enzyme = None, fulfilled_cogs = None):
        """Generator for all enzymes with membership of all target_enzyme_cogs, without duplicate enzymes or redundant genes."""
    
        base_enzyme_genes = proposed_enzyme or []
        fulfilled_cogs = get_enzyme_cogs(base_enzyme_genes, target_enzyme_cogs, gene_cog_dict)
    
        ## Which COG will we try to find a member of?
        next_cog_to_fill = sorted(list(target_enzyme_cogs-fulfilled_cogs))[0]
        gene_members_of_cog = cog_gene_dict[next_cog_to_fill] 
    
        for gene in gene_members_of_cog:
    
            ## Check whether any already-present gene's COG set is a subset of the proposed gene's COG set, if so skip addition
            subset_found = False
            proposed_gene_cogs = set(gene_cog_dict[gene]) & target_enzyme_cogs
            for gene_cogs_set in get_gene_by_gene_cogs(base_enzyme_genes, target_enzyme_cogs, gene_cog_dict):
                if gene_cogs_set.issubset(proposed_gene_cogs):
                    subset_found = True
                    break
            if subset_found:
                continue
    
            ## Add gene to proposed enzyme
            proposed_enzyme = deepcopy(base_enzyme_genes)
            proposed_enzyme.append(gene)
    
            ## Determine which COG memberships are fulfilled by the genes in the proposed enzyme
            fulfilled_cogs = get_enzyme_cogs(proposed_enzyme, target_enzyme_cogs, gene_cog_dict)
    
            if (fulfilled_cogs & target_enzyme_cogs) == target_enzyme_cogs:
                ## Proposed enzyme has members of every required COG, so yield 
                enzyme = deepcopy(proposed_enzyme)
                proposed_enzyme.remove(gene)
                yield enzyme
            else:
                ## Proposed enzyme is still missing some COG members
                for enzyme in add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict, proposed_enzyme, fulfilled_cogs):
                    yield enzyme
    

    Input:

    gene_cog_dict = {'gene1':['COG1','COG1003'], 'gene2':['COG2'], 'gene3':['COG273'], 'gene4':['COG1'], 'gene5':['COG273','COG71'], 'gene6':['COG1','COG273']}
    cog_gene_dict = {'COG2': ['gene2'], 'COG1': ['gene1', 'gene4', 'gene6'], 'COG71': ['gene5'], 'COG273': ['gene3', 'gene5', 'gene6'], 'COG1003': ['gene1']}
    
    target_enzyme_cogs = ['COG1','COG273']
    

    Usage:

    for enzyme in add_gene(target_enzyme_cogs, gene_cog_dict, cog_gene_dict):
        print enzyme
    

    Output:

    ['gene1', 'gene3']
    ['gene1', 'gene5']
    ['gene4', 'gene3']
    ['gene4', 'gene5']
    ['gene6']
    

    I have no idea about its performance though.