python for-loop bioinformatics dna-sequence

Why does my for loop (python) shift behaviour after 4 iterations?

I'm trying to write a programme that shifts through the elements, of defined length, of a DNA sequence, I can't understand the output I'm getting from the loop. It seems to frameshift fine for the first four iterations of the loop, then seems to revert to old sequences. I've tried really hard to understand the behaviour but I'm too new to programming to solve this, any help much appreciated.

Here is my code:

seq = "ACTGCATTTTGCATTTT"

search = "TGCATTTTG"

import regex as re

def kmers(text,n):
  for a in text:
    b = text[text.index(a):text.index(a)+n]
    c = len(re.findall(b, text, overlapped=True))
    print ("the count for " + b + " is " + str(c))

(kmers(seq,3))

and my output:

the count for ACT is 1
the count for CTG is 1
the count for TGC is 2
the count for GCA is 2
#I expected 'CAT' next, from here on I don't understand the behaviour

the count for CTG is 1 
the count for ACT is 1
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for GCA is 2
the count for CTG is 1
the count for ACT is 1
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2

Obviously eventually I want to remove duplicates, etc, but being stuck on why my for loop isn't working how I expected it to has stopped me in my tracks to make this better.

Thanks

Solution

text.index always returns the first index found. Since you iterate your seq letter by letter, the first time you hit a previously found letter you get weird results.

The 5th letter is the first duplicate, a c, and so text.index('c') is returning the index of the first c, 1, and not 4 as you expect - and you duplicate the previous time you hit c.

This method is inefficient - you seem to be more interested in moving across indices than letters, so I would use:

for a in range(len(text)-(n-1)):
    b = text[a:a+n]
    c = len(re.findall(b, text, overlapped=True))
    print ("the count for " + b + " is " + str(c))

Instead of searching for the index each time, which is both inefficient and in your case produces wrong results. findall is also an inefficient way of counting here - a dictionary, specifically defaultdict might be constructed to count more efficiently.

Note there are already nice builtins you may use:

>>> from collections import Counter
>>> seq='ACTGCATTTTGCATTTT'
>>> Counter((seq[i:i+3] for i in range(len(seq)-2)))
Counter({'TTT': 4, 'TGC': 2, 'GCA': 2, 'CAT': 2, 'ATT': 2, 'ACT': 1, 'CTG': 1, 'TTG': 1})

The final hits are where the string ends, and you can disregard them.