Search code examples
alignmentlocalbiopythondna-sequencesequence-alignment

Biopython: Local alignment between DNA sequences doesn't find optimal alignment


I'm writing code to find local alignments between two sequences. Here is a minimal, working example I've been working on:

from Bio import pairwise2
from Bio.pairwise2 import format_alignment
seq1 = "GTGGTCCTAGGC"
seq2 = "GCCTAGGACCAC"
# scores for the alignment
match =1
mismatch = -2
gapopen = -2
gapext = 0
# see: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
# 'localms' takes <seq1,seq2, match,mismatch,open,extend>
for a in pairwise2.align.localms(seq1,seq2,match,mismatch,gapopen,gapext):
    print(format_alignment(*a))

The following code runs with the output

GTGGTCCTAGGC----
      |||||
----GCCTAGGACCAC
  Score=5

But a score of '6' should be possible, finding the 'C-C' next to the 5 alignments, like so:

GTGGTCCTAGGC----
     ||||||
----GCCTAGGACCAC
  Score=6

Any ideas on what's going on?


Solution

  • This seems to be a bug in the current implementation of local alignments in Biopython's pairwise2 module. There is a recent pull request (#782) on Biopython's GitHub, which should solve your problem:

    >>> from Bio import pairwise2  # This is the version from the pull request
    >>> seq1 = 'GTGGTCCTAGGC'
    >>> seq2 = 'GCCTAGGACCAC'
    >>> for a in pairwise2.align.localms(seq1, seq2, 1, -2, -2, 0):
            print pairwise2.format_alignment(*a)
    
    
    GTGGTCCTAGGC----
         ||||||
    ----GCCTAGGACCAC
      Score=6
    
    1. If you are working with short sequences only, you can just download the code for pairwise2.py from the pull request mentioned above. In addition you need to 'inactivate' the respective C module (cpairwise2.pyd or cpairwise2.so), e.g. by renaming it or by removing the import of the C functions at the end of pairwise2.py(from .cpairwise import ...).

    2. If your are working with longer sequences, you will need the speed enhancement of the C module. Thus you also have to download cpairwise2module.c from the pull request and compile it into cpairwise2.pyd (for Windows systems) or cpairwise2.so (Unix, Linux).

    EDIT: In Biopython 1.68 the problem is solved.