Search code examples
pythonalignment

Needleman-wunsch algorithm for two sequences of different length


I wrote a code to align two sequences with the Needleman-Wunsch algorithm. it is going well for two sequences the same length. When I use two sequences of different lengths, I get an error: string index out of range. Could anyone help me?

Thats is my code:

import numpy as np 
s1='ACGGAAGGACAGGGAGATAGCGCGGGGAACGAGGAAGAGGGGGGATTCCAGGGAAGGGTAAT' 
s2='TCCGTATAATAGTGCTGTACTAAGCAAATTTATAGTTCTCTAGAAAGTGCCCGC'


M=len(s1)
N=len(s2)
match=+2
gap=-1
mismatch=-2
matrix=np.zeros((N+1,M+1))

#builduing the score matrix 

# 1. putting horizontal and vertical gaps in columns and rows 
for i in range(1,N+1):
    matrix[i,0]=matrix[i-1,0]+gap
for j  in range(1,M+1):
    matrix[0,j]=matrix[j-1,0]+gap
    
# 2. putting in the score/values in the score matrix 
    
for i in range(1,N+1):
       for j in range(1,M+1):
           if s1[i-1] == s2[j-1]:
               score1 = matrix[i-1,j-1] + match
           else:
               score1 = matrix[i-1,j-1] + mismatch
           score2 = matrix[i,j-1] + gap
           score3 = matrix[i-1,j] + gap
           matrix[i,j] = max(score1,score2,score3)
#create a traceback matrix with D=diagonal, V=vertikal, H=horizontal 
trace_mat=np.zeros((N+1,M+1),dtype=str)
for i in range(1,N+1):
    trace_mat[i,0]= 'V'
for j  in range(1,M+1):
    trace_mat[0,j]='H'

I would appreciate your help !!!





Solution

  • The updated code is given

    import numpy as np 
    s1='ACGGAAGGACAGGGAGATAGCGCGGGGAACGAGGAAGAGGGGGGATTCCAGGGAAGGGTAAT' 
    s2='TCCGTATAATAGTGCTGTACTAAGCAAATTTATAGTTCTCTAGAAAGTGCCCGC'
    
    N=len(s1)
    M=len(s2)
    match=+2
    gap=-1
    mismatch=-2
    matrix=np.zeros((N+1,M+1))
    
    #builduing the score matrix 
    
    # 1. putting horizontal and vertical gaps in columns and rows 
    for i in range(1,N+1):
        matrix[i,0]=matrix[i-1,0]+gap
    for j  in range(1,M+1):
        matrix[0,j]=matrix[0,j-1]+gap
        
    # 2. putting in the score/values in the score matrix 
        
    for i in range(1,N+1):
           for j in range(1,M+1):
               if s1[i-1] == s2[j-1]:
                   score1 = matrix[i-1,j-1] + match
               else:
                   score1 = matrix[i-1,j-1] + mismatch
               score2 = matrix[i,j-1] + gap
               score3 = matrix[i-1,j] + gap
               matrix[i,j] = max(score1,score2,score3)
    #create a traceback matrix with D=diagonal, V=vertikal, H=horizontal 
    trace_mat=np.zeros((N+1,M+1),dtype=str)
    for i in range(1,N+1):
        trace_mat[i,0]= 'V'
    for j  in range(1,M+1):
        trace_mat[0,j]='H'
    

    This should work fine if everything else you did is correct. However, I would like to point out what you did wrong:

    1. The line 'M=len(s1)', 'N=len(s2)' is the one of the error that i pointed. This states that N points to the second string and M points to the first string. Therefore, the matrix is defined as matrix[len(s2)][len(s2)]. However, in the nested for loop, you used 'i' as the row of the matrix to define the s1 string, and j as the column as the matrix to define string s2. But, this is not correct as you defined s1 to be the row of the matrix.

    2. If I state s1 as the row of the matrix and s2 as the column of the matrix, then the following loop is to be re-considered,

        for j in range(1,M+1):
            matrix[0,j]=matrix[j-1,0]+gap
    

    I believe this should not be done like this, and only fixing the error as stated in point 1 solves the string index out of range. However, I believe this loop segment should be written as follows

        for j  in range(1,M+1):
            matrix[0,j]=matrix[0,j-1]+gap
    

    The modification of the loop initializes the starting vertical values of the matrix. Assuming the dynamic programming approach, I believe this should be also fixed.