Search code examples
pythonfor-loopbiopythonfasta

Python: How to compare multiple sequences from a fasta file with each other?


I'm quite new to the programming world of python and I am trying to write a script that, given a FASTA file, will compare the sequences with each other and score them(If the position of the nucleotide in sequence A matches with the nucleotide in the same position of sequence B, then the score would go up 2 for example). Thus far I got this:

from Bio import SeqIO


def sequence_compare(file):
    seq_records = SeqIO.parse(file, "fasta") 
    for record in seq_records:
        len1 = len(str(record.seq))
        sequence1 = str(record.seq)
        print(sequence1)
        for record in seq_records:
            len2= len(str(record.seq))
            sequence2 = str(record.seq)
            print(sequence2)
            a = 0
            for pos in range (0,min(len1,len2)) :
                if sequence1[pos] == sequence2[pos]:
                    a+= 2
                if sequence1[pos] != sequence2[pos]:
                    a+= -1
                if sequence1[pos] == sequence2[pos] == '-':
                    a+= -2
            print(a)

The output it gives for a Fasta file with 3 sequences in it:

ACTGACTGACTGACTGACTG
ACTGACTGACTG-ACTGACT
16
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
-5

It seems to me that the first for loop only loops once and the second for loop doesn't start with the first sequence.

The desired output would be that each sequence is compared to each other and scored. So sequence 1 compared with sequence 1 and its score, sequence 1 compared with sequence 2 and its score, sequence 1 compared with sequence 3 and its score, etc...

If somebody could help me, that would be greatly appreciated!


Solution

  • The reason your code did not work is that you were using the same loop variable record for both the inner and outer loops. You may want to change it to record1 and record2 respectively.

    But better still, python provides support to do pairwise combinations in itertools.combinations(), so you can avoid nested loops.

    Also, it would be better to move your scoring algorithms into a separate function.

    With the above two changes in mind, here is the code:

    from Bio import SeqIO
    from itertools import combinations
    
    
    def score(sequence1, sequence2):
        a = 0
        for pos in range(0, min(len(sequence1), len(sequence2))):
            if sequence1[pos] == sequence2[pos]:
                a += 2
            if sequence1[pos] != sequence2[pos]:
                a += -1
            if sequence1[pos] == sequence2[pos] == '-':
                a += -2
        return a
    
    
    def sequence_compare(file):
        seq_records = SeqIO.parse(file, "fasta")
        for record1, record2 in combinations(seq_records, 2):
            sequence1 = record1.seq
            sequence2 = record2.seq
            a = score(sequence1, sequence2)
            print(sequence1)
            print(sequence2)
            print(a)