Search code examples
algorithmsequence-alignment

Sequence alignment with minimum subsequence length constraint


How can I implement sequence alignment with minimum subsequence length constraint. For example let for these inputs minimum sub-sequence length be 3. Using Smith-Waterman gives output like below.

ATACGGACC
||    |||
ATCATAACC

But instead I need like below.

---ATACGGACC
   |||   |||
ATCATA---ACC

Is there a know algorithm for this, or do you know an approach?

Thanks in advance.


Solution

  • Look at how Smith-Waterman works. You have two sequences (S1 of length N, S2 of length M) and you create an NxM matrix (let's call it A) where A(i,j) is "the best score alignment of S1[1..i] and S2[1..j]." To do that, you write a recurrance about how to get A(i,j) based on the last element - A(i,j) is the best of

    A(i-1, j-1) + match/mismatch score
    A(i,j-1) + indel score
    A(i-1,j) + indel score
    

    Here's the basic idea; you may need to tweak it a little.

    To do what you're asking, you need two matrices. Let A(i,j) be "the best score alignment of S1[1..i] and S2[1..j] ending with a match" and B(i,j) be "the best score alignment of S1[1..i] and S2[1..j] ending with an indel."

    Let's start with B, because it's easier. B(i,j) is the best of

    A(i,j-1) + indel score
    A(i-1,j) + indel score
    B(i,j-1) + indel score
    B(i-1,j) + indel score
    

    And A(i,j) is the best of

    A(i-1, j-1) + match/mismatch score
    B(i-3, j-3) + match/mismatch score for the three