Search code examples
pythonbioinformaticsregular-languagefasta

How to retrieve FASTA sequences according to coordinate information using Python?


I would like to obtain the sequences according to the bed file B.bed which contain coordinates information of sequences by matching the coordinates to the fasta file which is A.fasta and retrieve the corresponding sequences according to B.bed file. Fasta file is normal file with sequence ID and followed by its nucleotides ATCG. But I get the key error below. Can anyone help me?

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict

# read names and postions from bed file
positions = defaultdict(list)
with open('B.bed') as f:
    for line in f:
        name, start, stop = line.split()
        positions[name].append((int(start), int(stop)))

# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('A.fasta'), 'fasta'))

# search for short sequences
short_seq_records = []
for name in positions:
    for (start, stop) in positions[name]:        
        long_seq_record = records[name]
        long_seq = long_seq_record.seq
        alphabet = long_seq.alphabet
        short_seq = str(long_seq)[start-1:stop]        
        short_seq_record =SeqRecord(Seq(name))+ '\t'+str(start)+'\t'+str(stop)+'\t' + SeqRecord(Seq(short_seq, alphabet))
    short_seq_records.append(short_seq_record)

# write to file
with open('C.fasta', 'w') as f:
    SeqIO.write(short_seq_records,f, 'fasta')

A.fasta

>chr16:13361561-13361573
TAGTGGGTCAGAC
>chr6_apd_hap1:2165669-2165681
AGATGAGTCATCA
>chr10:112612173-112612185
AAGTGTGTCAGCT

B.bed

chr6_apd_hap1   2165668 2165681
chr10   112612172   112612185

Expected output for C.fasta

>chr6_apd_hap1:2165669-2165681
AGATGAGTCATCA
>chr10:112612173-112612185
AAGTGTGTCAGCT

But I received error below:

long_seq_record = records[name]
KeyError: 'chr6_apd_hap1'

Solution

  • In your code, positions is a defaultdict which has as keys the names from the BED file:

    >>> print positions.keys()
    ['chr10', 'chr6_apd_hap1']
    

    And records is a dictionary which has as keys the headers of the FASTA file, minus the > at the beginning, but they still include the colon and the position on the chromosome:

    >>> print records.keys()
    ['chr16:13361561-13361573', 'chr6_apd_hap1:2165669-2165681', 'chr10:112612173-112612185']
    

    So you first need to transform your records keys to loose the extra info so that you can use the positions keys to retrieve them. You can do this by adding the following line after you've created the records dictionary:

    records = {key.split(':')[0]: value for (key, value) in records.iteritems()}
    

    Also, the way you're currently constructing your short_seq_record doesn't really work. Replace the lines

    short_seq = str(long_seq)[start-1:stop]        
    short_seq_record =SeqRecord(Seq(name))+ '\t'+str(start)+'\t'+str(stop)+'\t' + SeqRecord(Seq(short_seq, alphabet))
    

    with:

    short_seq = Seq(str(long_seq)[start-1:stop], alphabet)
    short_seq_id = '{0}\t{1}\t{2}'.format(name, start, stop)
    short_seq_record = SeqRecord(short_seq, id=short_seq_id, description='')