I am trying to optimize my code since when I try to load huge dictionaries it becomes really slow. I think It's because it searchs for a key in the dictionary. I've been reading about python defaultdict
and I think it might be a good improvement but I fail to implement it here. As you can see is a hierarchichal dictionary structure. Any hint will be appreciated.
class Species:
'''This structure contains all the information needed for all genes.
One specie have several genes, one gene several proteins'''
def __init__(self, name):
self.name = name #name of the GENE
self.genes = {}
def addProtein(self, gene, protname, len):
#Converting a line from the input file into a protein and/or an exon
if gene in self.genes:
#Gene in the structure
self.genes[gene].proteins[protname] = Protein(protname, len)
self.genes[gene].updateProts()
else:
self.genes[gene] = Gene(gene)
self.updateNgenes()
self.genes[gene].proteins[protname] = Protein(protname, len)
self.genes[gene].updateProts()
def updateNgenes(self):
#Updating the number of genes
self.ngenes = len(self.genes.keys())
The definitions of gene and Protein are:
class Protein:
#The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
def __init__(self, name, len):
self.name = name
self.len = len
class Gene:
#The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
def __init__(self, name):
self.name = name
self.proteins = {}
self.updateProts()
def updateProts(self):
#Update number of proteins
self.nproteins = len(self.proteins)
You cannot use a defaultdict
because your __init__
methods require arguments.
This is probably one of your bottlenecks:
def updateNgenes(self):
#Updating the number of genes
self.ngenes = len(self.genes.keys())
len(self.genes.keys())
creates a list
of all keys before calculating length. This means that every time you add a gene, you create a list and throw it away. This list creation gets more and more expensive the more genes you have. To avoid creating an intermediate list, just do len(self.genes)
.
Better yet would be to make ngenes
a property so it is only calculated when you need it.
@property
def ngenes(self):
return len(self.genes)
The same can be done with nproteins
in the Gene
class.
Here is your code refactored:
class Species:
'''This structure contains all the information needed for all genes.
One specie have several genes, one gene several proteins'''
def __init__(self, name):
self.name = name #name of the GENE
self.genes = {}
def addProtein(self, gene, protname, len):
#Converting a line from the input file into a protein and/or an exon
if gene not in self.genes:
self.genes[gene] = Gene(gene)
self.genes[gene].proteins[protname] = Protein(protname, len)
@property
def ngenes(self):
return len(self.genes)
class Protein:
#The class protein contains information about the length of the protein and a list with it's exons (with it's own attributes)
def __init__(self, name, len):
self.name = name
self.len = len
class Gene:
#The class gene contains information about the gene and a dict with it's proteins (with it's own attributes)
def __init__(self, name):
self.name = name
self.proteins = {}
@property
def nproteins(self):
return len(self.proteins)