I use Python 2.7 and the regex module. I use this expression to find a short sequence in a longer DNA sequence:
output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH)
The parameters are :
Is there a way to get the positions of all the sequences that match the regex in the genome? Does this script finds overlapping matches? It works pretty well but then I decided to try, say :
probe = "TTGACAT"
genome = "TTGACATTGACATATAAT"
mismatches = 0
I got :
['TTGACAT']
With the same parameters but mismatches = 10
I got :
['TTGACAT','GACATAT']
So I do not know if the script finds 'TTGACAT' only once because it overlaps with the second occurence or if it actually finds 'TTGACAT' twice and shows the result only once...
Thanks
This is because it overlaps with the second occurence.
If you want all overlapping results, you must use the same pattern with the overlapped flag:
output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH, overlapped=True)
If you want to know the sequence position:
for m in regex.finditer(r'(?:'+probe+'){s<'+str(mismatches+1)+'}', sequence, regex.BESTMATCH, overlapped=True):
print '%d: %s' % (m.start(), m.group())
As an aside comment: The limit with overlapping results
If I use these three parameters:
probe = "ACTG.*ACTG"
sequence = "ACTGTTGACATTGAACTGCATATAATACTG"
mismatches = 0
I will find only two results: ['ACTGTTGACATTGAACTGCATATAATACTG', 'ACTGCATATAATACTG']
instead of three. Because two results can not start at the same position in the string.