I have a list of short sequences that I want to obtain its coordinate or in another word to get its bed file after compare with a fasta file which contains original sequences.
Fasta file:
>PGH2
CGTAGCGGCTGAGTGCGCGGATAGCGCGTA
Short sequence fasta file:
>PGH2
CGGCTGAGT
Is there any ways to obtain its coordinate? Bedtools can't help much.
Desired output:
PGH2 6 14
Using BioPyton
from Bio import SeqIO
for long_sequence_record in SeqIO.parse(open('long_sequences.fasta'), 'fasta'):
long_sequence = str(long_sequence_record.seq)
for short_sequence_record in SeqIO.parse(open('short_sequences.fasta'), 'fasta'):
short_sequence = str(short_sequence_record.seq)
if short_sequence in long_sequence:
start = long_sequence.index(short_sequence) + 1
stop = start + len(short_sequence) - 1
print short_sequence_record.id, start, stop