I want to determine length of individual sequences in a multifasta file. I got this biopython code from the bio manual as:
from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
output_line = '%s\t%i' % \
(seq_record.id, len(seq_record))
print(output_line)
My input file is like:
>Protein1
MNT
>Protein2
TSMN
>Protein3
TTQRT
And the code yields:
Protein1 3
Protein2 4
Protein3 5
But I want to calculate the length of a sequence after adding the length of previous sequences. It would be like:
Protein1 1-3
Protein2 4-7
Protein3 8-12
I don't know in which of the above line in the code I need to change to get that output. I'd appreciate any help on this issue, thanks!!!!
It is easy just to get the total length:
from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
total_len = 0
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
total_len += len(seq_record)
output_line = '%s\t%i' % (seq_record.id, total_len))
print(output_line)
To get a range:
from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
total_len = 0
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
previous_total_len = total_len
total_len += len(seq_record)
output_line = '%s\t%i - %i' % (seq_record.id, previous_total_len + 1, total_len)
print(output_line)