Search code examples
pythonbiopythonfasta

calculate the length of a sequence after adding the length of previous sequences


I want to determine length of individual sequences in a multifasta file. I got this biopython code from the bio manual as:

from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
 output_line = '%s\t%i' % \
(seq_record.id, len(seq_record))
 print(output_line)

My input file is like:

>Protein1
MNT
>Protein2
TSMN
>Protein3
TTQRT

And the code yields:

Protein1        3
Protein2        4
Protein3        5

But I want to calculate the length of a sequence after adding the length of previous sequences. It would be like:

Protein1        1-3
Protein2        4-7
Protein3        8-12

I don't know in which of the above line in the code I need to change to get that output. I'd appreciate any help on this issue, thanks!!!!


Solution

  • It is easy just to get the total length:

    from Bio import SeqIO
    import sys
    cmdargs = str(sys.argv)
    total_len = 0
    for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
        total_len += len(seq_record)
        output_line = '%s\t%i' % (seq_record.id, total_len))
        print(output_line)
    

    To get a range:

    from Bio import SeqIO
    import sys
    cmdargs = str(sys.argv)
    total_len = 0
    for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
        previous_total_len = total_len
        total_len += len(seq_record)
        output_line = '%s\t%i - %i' % (seq_record.id, previous_total_len + 1, total_len)
        print(output_line)