I am looking for a way to find the total number of mismatches between two strings in python. My input is a list that looks like this
['sequence=AGATGG', 'sequence=AGCTAG', 'sequence=TGCTAG',
'sequence=AGGTAG', 'sequence=AGCTAG', 'sequence=AGAGAG']
and I for each string, I want to see how many differences it would have from the sequence "sequence=AGATAA"
. so if the input was the [0]
from the list above, the output would read like this:
sequence=AGATGG, 2
I cannot figure out whether to split each of the letters into individual lists or if I should try and compare the whole string somehow. Any help is useful, thanks
First of all, I think your safest bet it to use Levenshtein distance with some library. But since you are tagging with Biopython, you can use pairwise
:
First you want to get rid of the "sequence=". You can slice each string or
seqs = [x.split("=")[1] for x in ['sequence=AGATGG',
'sequence=AGCTAG',
'sequence=TGCTAG',
'sequence=AGGTAG',
'sequence=AGCTAG',
'sequence=AGAGAG']]
Now define the reference sequence:
ref_seq = "AGATAA"
And using pairwise
you can calculate the alignment:
from Bio import pairwise2
for seq in seqs:
print pairwise2.align.globalxx(ref_seq, seq)
I'm using pairwise2.align.globalxx
that is alignment without parameters. Other functions accept different values for matches and gaps. Check them at http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html.