Search code examples
pythonbioinformaticsfasta

How to change the coordinates format according to BED file format using Python?


I have two fasta file that I would like to match shorter sequences which is in FileB.fasta to original sequences which is in FileA.fasta to obtain its coordinates or locations. But my output is not in correct format. Can anyone help me?

FileA.fasta

>chr1:2000-2019
ACGTCGATCGGTCGACGTGC

FileB.fasta

>chr1:2000-2019
GATCGG

FileC.bed

chr1:2000-2019 6 11

Code

from Bio import SeqIO
output_file = open('fileC.bed','w')
for long_sequence_record in SeqIO.parse(open('fileA.fasta'), 'fasta'):
    long_sequence = str(long_sequence_record.seq)
    for short_sequence_record in SeqIO.parse(open('fileB.fasta'), 'fasta'):
        short_sequence = str(short_sequence_record.seq)
        if short_sequence in long_sequence:
            start = long_sequence.index(short_sequence) + 1
            stop = start + len(short_sequence) - 1
           # print short_sequence_record.id, start, stop
            output_line ='%s\t%i\t%i\n' % \
            (short_sequence_record.id,start,stop)
            output_file.write(output_line )
output_file.close()

Desired Output for FileC.bed

chr1 2005 2011

Solution

  • Well, you are adding a 1 to the index at which are finding the shorter sequence -

    start = long_sequence.index(short_sequence) + 1 <--- notice the +1
    

    Don't do that and it should be fine. Also do not do -1 for the stop variable.

    You should instead add the starting sequence number from the id.

    Example -

    start = long_sequence.index(short_sequence) + int((short_sequence_record.id.split(':')[1].split('-')[0]))
    stop = start + len(short_sequence)
    

    For the id of the record , if you do not want anything before the : then you should split the id at : and take the left part (0 index string after split).

    Example -

    output_line ='%s\t%i\t%i\n' % \
            ((short_sequence_record.id.split(':')[0]),start,stop)
            output_file.write(output_line )