Search code examples
fastasamtools

How to locate the position of non-missing sequence in the fasta


I have a fasta file including about 1,000 sequences

each sequence have head and tail with missing replaced with 'N' like this

>CHR1
NNNNNAAAGAGAGAGNNTTTAGAGAGGGACNNNNNN

I want to get the start and end position of the target sequence (if there are N in the middle of target sequence, is OK)

for my example with 36 nucleotide, the start and the end position of the target sequence is CHR1:6-30 (because there are 5 N in the head and 6 N in the tail)

which means I want to know the position of the first nucleotide which is not N from both ends of the full sequence

does anyone know how to make it for all of the sequences in the same fasta file?

thanks a lot


Solution

  • One way using :

    from Bio import SeqIO
    
    
    for rec in SeqIO.parse("file.fa", "fasta"):
        seq_length = len(rec.seq)
    
        leading_ns = seq_length - len(rec.seq.lstrip('N'))
        lagging_ns = seq_length - len(rec.seq.rstrip('N'))
    
        start = leading_ns + 1
        stop = seq_length - lagging_ns
    
        print(f'{rec.id}:{start}-{stop}')