I wrote a code to align two sequences with the Needleman-Wunsch algorithm. it is going well for two sequences the same length. When I use two sequences of different lengths, I get an error: string index out of range
. Could anyone help me?
Thats is my code:
import numpy as np
s1='ACGGAAGGACAGGGAGATAGCGCGGGGAACGAGGAAGAGGGGGGATTCCAGGGAAGGGTAAT'
s2='TCCGTATAATAGTGCTGTACTAAGCAAATTTATAGTTCTCTAGAAAGTGCCCGC'
M=len(s1)
N=len(s2)
match=+2
gap=-1
mismatch=-2
matrix=np.zeros((N+1,M+1))
#builduing the score matrix
# 1. putting horizontal and vertical gaps in columns and rows
for i in range(1,N+1):
matrix[i,0]=matrix[i-1,0]+gap
for j in range(1,M+1):
matrix[0,j]=matrix[j-1,0]+gap
# 2. putting in the score/values in the score matrix
for i in range(1,N+1):
for j in range(1,M+1):
if s1[i-1] == s2[j-1]:
score1 = matrix[i-1,j-1] + match
else:
score1 = matrix[i-1,j-1] + mismatch
score2 = matrix[i,j-1] + gap
score3 = matrix[i-1,j] + gap
matrix[i,j] = max(score1,score2,score3)
#create a traceback matrix with D=diagonal, V=vertikal, H=horizontal
trace_mat=np.zeros((N+1,M+1),dtype=str)
for i in range(1,N+1):
trace_mat[i,0]= 'V'
for j in range(1,M+1):
trace_mat[0,j]='H'
I would appreciate your help !!!
The updated code is given
import numpy as np
s1='ACGGAAGGACAGGGAGATAGCGCGGGGAACGAGGAAGAGGGGGGATTCCAGGGAAGGGTAAT'
s2='TCCGTATAATAGTGCTGTACTAAGCAAATTTATAGTTCTCTAGAAAGTGCCCGC'
N=len(s1)
M=len(s2)
match=+2
gap=-1
mismatch=-2
matrix=np.zeros((N+1,M+1))
#builduing the score matrix
# 1. putting horizontal and vertical gaps in columns and rows
for i in range(1,N+1):
matrix[i,0]=matrix[i-1,0]+gap
for j in range(1,M+1):
matrix[0,j]=matrix[0,j-1]+gap
# 2. putting in the score/values in the score matrix
for i in range(1,N+1):
for j in range(1,M+1):
if s1[i-1] == s2[j-1]:
score1 = matrix[i-1,j-1] + match
else:
score1 = matrix[i-1,j-1] + mismatch
score2 = matrix[i,j-1] + gap
score3 = matrix[i-1,j] + gap
matrix[i,j] = max(score1,score2,score3)
#create a traceback matrix with D=diagonal, V=vertikal, H=horizontal
trace_mat=np.zeros((N+1,M+1),dtype=str)
for i in range(1,N+1):
trace_mat[i,0]= 'V'
for j in range(1,M+1):
trace_mat[0,j]='H'
This should work fine if everything else you did is correct. However, I would like to point out what you did wrong:
The line 'M=len(s1)', 'N=len(s2)' is the one of the error that i pointed. This states that N points to the second string and M points to the first string. Therefore, the matrix is defined as matrix[len(s2)][len(s2)]
. However, in the nested for loop, you used 'i' as the row of the matrix to define the s1
string, and j as the column as the matrix to define string s2
. But, this is not correct as you defined s1
to be the row of the matrix.
If I state s1
as the row of the matrix and s2
as the column of the matrix, then the following loop is to be re-considered,
for j in range(1,M+1):
matrix[0,j]=matrix[j-1,0]+gap
I believe this should not be done like this, and only fixing the error as stated in point 1 solves the string index out of range
. However, I believe this loop segment should be written as follows
for j in range(1,M+1):
matrix[0,j]=matrix[0,j-1]+gap
The modification of the loop initializes the starting vertical values of the matrix. Assuming the dynamic programming approach, I believe this should be also fixed.