I have two fasta file that I would like to match shorter sequences which is in FileB.fasta to original sequences which is in FileA.fasta to obtain its coordinates or locations. But my output is not in correct format. Can anyone help me?
FileA.fasta
>chr1:2000-2019
ACGTCGATCGGTCGACGTGC
FileB.fasta
>chr1:2000-2019
GATCGG
FileC.bed
chr1:2000-2019 6 11
Code
from Bio import SeqIO
output_file = open('fileC.bed','w')
for long_sequence_record in SeqIO.parse(open('fileA.fasta'), 'fasta'):
long_sequence = str(long_sequence_record.seq)
for short_sequence_record in SeqIO.parse(open('fileB.fasta'), 'fasta'):
short_sequence = str(short_sequence_record.seq)
if short_sequence in long_sequence:
start = long_sequence.index(short_sequence) + 1
stop = start + len(short_sequence) - 1
# print short_sequence_record.id, start, stop
output_line ='%s\t%i\t%i\n' % \
(short_sequence_record.id,start,stop)
output_file.write(output_line )
output_file.close()
Desired Output for FileC.bed
chr1 2005 2011
Well, you are adding a 1 to the index at which are finding the shorter sequence -
start = long_sequence.index(short_sequence) + 1 <--- notice the +1
Don't do that and it should be fine. Also do not do -1
for the stop
variable.
You should instead add the starting sequence number from the id.
Example -
start = long_sequence.index(short_sequence) + int((short_sequence_record.id.split(':')[1].split('-')[0]))
stop = start + len(short_sequence)
For the id
of the record , if you do not want anything before the :
then you should split the id at :
and take the left part (0 index string after split).
Example -
output_line ='%s\t%i\t%i\n' % \
((short_sequence_record.id.split(':')[0]),start,stop)
output_file.write(output_line )