The implementation in skbio gives an odd result compared to the result that you would get from the implementation in pycogent.
from cogent.align.algorithm import nw_align as nw_align_cogent
from skbio.alignment import global_pairwise_align_nucleotide as nw_align_scikit
seq_1 = 'ATCGATCGATCG'
seq_2 = 'ATCGATATCGATCG'
print "Sequences: "
print " %s" % seq_1
print " %s" % seq_2
print
alignment = nw_align_scikit(seq_1, seq_2)
al_1, al_2 = [alignment.get_seq(_id).__str__() for _id in alignment.ids()]
print " nw alignment using scikit:"
print " %s" % al_1
print " %s" % al_2
print
al_1, al_2 = nw_align_cogent(seq_1, seq_2)
print " nw alignment using cogent:"
print " %s" % al_1
print " %s" % al_2
print
The output looks like this:
nw alignment using scikit:
------ATCGATCGATCG
ATCGATATCGATCG----
nw alignment using cogent:
ATCGAT--CGATCG
ATCGATATCGATCG
This is a good question, and has to do with differences in how alignments are scored in scikit-bio and PyCogent. By default, in scikit-bio, terminal gaps are not penalized as that can lead to some very strange alignments. This issue was briefly discussed here and illustrated here (see the last cell of the notebook).
If you want to achieve a solution more similar to the one in PyCogent, you can pass penalize_terminal_gaps=True
to global_pairwise_align_nucleotide
as follows:
alignment = nw_align_scikit(seq_1, seq_2, penalize_terminal_gaps=True)
al_1, al_2 = [alignment.get_seq(_id).__str__() for _id in alignment.ids()]
print " nw alignment using scikit:"
print " %s" % al_1
print " %s" % al_2
output:
nw alignment using scikit:
ATCG--ATCGATCG
ATCGATATCGATCG
You'll notice that the alignment is still not identical to the one you get from PyCogent, but this is a minor implementation difference: the two resulting alignments have the same score (the difference is in whether the --
aligns to the first AT
or the second AT
in the ATAT
repeat), and the two implementations make a different choice in how they break that tie.
If you go back to the alignment that you posted (the default from scikit-bio), what you'll notice is that the alignment that is returned is very good - in fact, it's the optimally scoring alignment if not penalizing the terminal gaps (by definition, because the optimal scoring alignment is what it returns). However, you're right that it's strange, because the alignment that scikit-bio returns in this specific case is probably not the most biologically relevant alignment. If you know that your sequences all start at the same position and end at the same position, you could pass penalize_terminal_gaps=True
. However, yours is a toy example and in most cases with real sequences, I think the most biologically relevant alignment would be returned with the default parameters.