I have a fasta file including about 1,000 sequences
each sequence have head and tail with missing replaced with 'N' like this
>CHR1
NNNNNAAAGAGAGAGNNTTTAGAGAGGGACNNNNNN
I want to get the start and end position of the target sequence (if there are N in the middle of target sequence, is OK)
for my example with 36 nucleotide, the start and the end position of the target sequence is CHR1:6-30 (because there are 5 N in the head and 6 N in the tail)
which means I want to know the position of the first nucleotide which is not N from both ends of the full sequence
does anyone know how to make it for all of the sequences in the same fasta file?
thanks a lot
One way using biopython:
from Bio import SeqIO
for rec in SeqIO.parse("file.fa", "fasta"):
seq_length = len(rec.seq)
leading_ns = seq_length - len(rec.seq.lstrip('N'))
lagging_ns = seq_length - len(rec.seq.rstrip('N'))
start = leading_ns + 1
stop = seq_length - lagging_ns
print(f'{rec.id}:{start}-{stop}')