This code runs quickly on the sample data, but when iterating over a large file, it seems to run slowly, perhaps because of the nested for loops. Is there any other reason why iterating over items in a defaultdict is slow?
import itertools
sample_genes1={"0002":["GENE1", "GENE2", "GENE3", "GENE4"],
"0003":["GENE1", "GENE2", "GENE3", "GENE6"],
"0202":["GENE4", "GENE2", "GENE1", "GENE7"]}
def get_common_gene_pairs(genelist):
genedict={}
for k,v in sample_genes1.items():
listofpairs=[]
for i in itertools.combinations(v,2):
listofpairs.append(i)
genedict[k]=listofpairs
return genedict
from collections import namedtuple,defaultdict
def get_gene_pair_pids(genelist):
i=defaultdict(list)
d=get_common_gene_pairs(sample_genes1)
Pub_genes=namedtuple("pair", ["gene1", "gene2"])
for p_id,genepairs in d.iteritems():
for p in genepairs:
thispair=Pub_genes(p[0], p[1])
if thispair in i.keys():
i[thispair].append(p_id)
else:
i[thispair]=[p_id,]
return i
if __name__=="__main__":
get_gene_pair_pids(sample_genes1)
One big problem: this line:
if thispair in i.keys():
doesn't take advantage of dictionary search, it's linear search. Drop the keys()
call, let the dictionary do its fast lookup:
if thispair in i:
BUT since i
is a default dict which creates a list
when key doesn't exist, just replace:
if thispair in i.keys():
i[thispair].append(p_id) # i is defaultdict: even if thispair isn't in the dict, it will create a list and append p_id.
else:
i[thispair]=[p_id,]
by this simple statement:
i[thispair].append(p_id)
(it's even faster because there's only one hashing of p_id
)
to sum it up:
thispair in i.keys()
: it's slow, in python 2, or 3, defaultdict
or notdefaultdict
, but your code just assumed a dict
, which works but slower.Note: without defaultdict
you could have just removed the .keys()
or done this with a simple dict
:
i.setdefault(thispair,list)
i[thispair].append(p_id)
(here the default item depends on the key)
Aside:
def get_common_gene_pairs(genelist):
genedict={}
for k,v in sample_genes1.items(): # should be genelist, not the global variable, you're not using the genelist parameter at all
And you're not using the values of sample_genes1 at all in your code.