Search code examples
pythonpython-2.7bioinformaticsbiopythonfasta

Python: How to find coordinates of short sequences in a FASTA file?


I have a list of short sequences that I want to obtain its coordinate or in another word to get its bed file after compare with a fasta file which contains original sequences.

Fasta file:

>PGH2
CGTAGCGGCTGAGTGCGCGGATAGCGCGTA

Short sequence fasta file:

>PGH2
CGGCTGAGT

Is there any ways to obtain its coordinate? Bedtools can't help much.

Desired output:

PGH2  6 14

Solution

  • Using BioPyton

    from Bio import SeqIO
    
    for long_sequence_record in SeqIO.parse(open('long_sequences.fasta'), 'fasta'):
        long_sequence = str(long_sequence_record.seq)
    
        for short_sequence_record in SeqIO.parse(open('short_sequences.fasta'), 'fasta'):
            short_sequence = str(short_sequence_record.seq)
    
            if short_sequence in long_sequence:
                start = long_sequence.index(short_sequence) + 1
                stop = start + len(short_sequence) - 1
                print short_sequence_record.id, start, stop