I am trying to keep a list of top k elements of a large set of tuples. Since keeping it in the memory is impossible, I want to use a fixed size list to keep only the top k values (with the keys). I have tried to use min heap but python's heap is terrible as it lets non unique keys to be inserted. That is a huge problem. So I thought I can use a sorted list/dict instead (tuples with unique keys). Using the sketch function I retrieve the number of counts that substring has appeared in the whole text (O(1) time)). I am beginning to think I do something wrong with the loops or pops and assignments, because the minheap also has a similar problem where only the top k shows up in the 25 size list, and the rest are rather low counts (when it is in fact higher )
for line in lines[1::4]:
startIdx = 0
while startIdx + k <= (len(line)-k):
kmer = line[startIdx:(startIdx+k)]
count = randint(1, 250)
if count > 2:
if len(tdict.keys()) < topcount:
tdict[km] = count
else:
kMin = (sorted(tdict,reverse = False, key=lambda x: x[1]))
if count > tdict[kMin[0]]:
topkmerdict.pop(kMin[0])
topkmerdict[km] = count
startIdx += 1
linesProcessed += 1
Please try changing the line:
kmerMin = (sorted(topkmerdict,reverse = False, key=lambda x: x[1]))
to:
kmerMin = (sorted(topkmerdict,reverse = False)
The previous line is only sorting on the second character of the string key values.