Search code examples
pythoncollectionsdefaultdictnamedtuple

Slowness on iterating over namedtuple and dicts for defined pairs


This code runs quickly on the sample data, but when iterating over a large file, it seems to run slowly, perhaps because of the nested for loops. Is there any other reason why iterating over items in a defaultdict is slow?

import itertools

sample_genes1={"0002":["GENE1", "GENE2", "GENE3", "GENE4"],
               "0003":["GENE1", "GENE2", "GENE3", "GENE6"],
               "0202":["GENE4", "GENE2", "GENE1", "GENE7"]}

def get_common_gene_pairs(genelist):
    genedict={}
    for k,v in sample_genes1.items():
        listofpairs=[]
        for i in itertools.combinations(v,2):
            listofpairs.append(i)
            genedict[k]=listofpairs
    return genedict

from collections import namedtuple,defaultdict
def get_gene_pair_pids(genelist):
    i=defaultdict(list)
    d=get_common_gene_pairs(sample_genes1)
    Pub_genes=namedtuple("pair", ["gene1", "gene2"])
    for p_id,genepairs in d.iteritems():
        for p in genepairs:
            thispair=Pub_genes(p[0], p[1])
            if thispair in i.keys():
                i[thispair].append(p_id)
            else:
                i[thispair]=[p_id,]
    return i

if __name__=="__main__":
    get_gene_pair_pids(sample_genes1)

Solution

  • One big problem: this line:

    if thispair in i.keys():
    

    doesn't take advantage of dictionary search, it's linear search. Drop the keys() call, let the dictionary do its fast lookup:

    if thispair in i:
    

    BUT since i is a default dict which creates a list when key doesn't exist, just replace:

    if thispair in i.keys():
        i[thispair].append(p_id)  # i is defaultdict: even if thispair isn't in the dict, it will create a list and append p_id.
    else:
        i[thispair]=[p_id,]
    

    by this simple statement:

    i[thispair].append(p_id)
    

    (it's even faster because there's only one hashing of p_id)

    to sum it up:

    • don't do thispair in i.keys(): it's slow, in python 2, or 3, defaultdict or not
    • you have defined a defaultdict, but your code just assumed a dict, which works but slower.

    Note: without defaultdict you could have just removed the .keys() or done this with a simple dict:

    i.setdefault(thispair,list)
    i[thispair].append(p_id)
    

    (here the default item depends on the key)

    Aside:

    def get_common_gene_pairs(genelist):
        genedict={}
        for k,v in sample_genes1.items():  # should be genelist, not the global variable, you're not using the genelist parameter at all
    

    And you're not using the values of sample_genes1 at all in your code.