I'm trying to write a programme that shifts through the elements, of defined length, of a DNA sequence, I can't understand the output I'm getting from the loop. It seems to frameshift fine for the first four iterations of the loop, then seems to revert to old sequences. I've tried really hard to understand the behaviour but I'm too new to programming to solve this, any help much appreciated.
Here is my code:
seq = "ACTGCATTTTGCATTTT"
search = "TGCATTTTG"
import regex as re
def kmers(text,n):
for a in text:
b = text[text.index(a):text.index(a)+n]
c = len(re.findall(b, text, overlapped=True))
print ("the count for " + b + " is " + str(c))
(kmers(seq,3))
and my output:
the count for ACT is 1
the count for CTG is 1
the count for TGC is 2
the count for GCA is 2
#I expected 'CAT' next, from here on I don't understand the behaviour
the count for CTG is 1
the count for ACT is 1
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for GCA is 2
the count for CTG is 1
the count for ACT is 1
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
Obviously eventually I want to remove duplicates, etc, but being stuck on why my for loop isn't working how I expected it to has stopped me in my tracks to make this better.
Thanks
text.index
always returns the first index found. Since you iterate your seq
letter by letter, the first time you hit a previously found letter you get weird results.
The 5th letter is the first duplicate, a c
, and so text.index('c')
is returning the index of the first c
, 1, and not 4 as you expect - and you duplicate the previous time you hit c
.
This method is inefficient - you seem to be more interested in moving across indices than letters, so I would use:
for a in range(len(text)-(n-1)):
b = text[a:a+n]
c = len(re.findall(b, text, overlapped=True))
print ("the count for " + b + " is " + str(c))
Instead of searching for the index each time, which is both inefficient and in your case produces wrong results. findall
is also an inefficient way of counting here - a dictionary, specifically defaultdict
might be constructed to count more efficiently.
Note there are already nice builtins you may use:
>>> from collections import Counter
>>> seq='ACTGCATTTTGCATTTT'
>>> Counter((seq[i:i+3] for i in range(len(seq)-2)))
Counter({'TTT': 4, 'TGC': 2, 'GCA': 2, 'CAT': 2, 'ATT': 2, 'ACT': 1, 'CTG': 1, 'TTG': 1})
The final hits are where the string ends, and you can disregard them.