I would like to obtain the sequences according to the bed file B.bed which contain coordinates information of sequences by matching the coordinates to the fasta file which is A.fasta and retrieve the corresponding sequences according to B.bed file. Fasta file is normal file with sequence ID and followed by its nucleotides ATCG. But I get the key error below. Can anyone help me?
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from collections import defaultdict
# read names and postions from bed file
positions = defaultdict(list)
with open('B.bed') as f:
for line in f:
name, start, stop = line.split()
positions[name].append((int(start), int(stop)))
# parse faste file and turn into dictionary
records = SeqIO.to_dict(SeqIO.parse(open('A.fasta'), 'fasta'))
# search for short sequences
short_seq_records = []
for name in positions:
for (start, stop) in positions[name]:
long_seq_record = records[name]
long_seq = long_seq_record.seq
alphabet = long_seq.alphabet
short_seq = str(long_seq)[start-1:stop]
short_seq_record =SeqRecord(Seq(name))+ '\t'+str(start)+'\t'+str(stop)+'\t' + SeqRecord(Seq(short_seq, alphabet))
short_seq_records.append(short_seq_record)
# write to file
with open('C.fasta', 'w') as f:
SeqIO.write(short_seq_records,f, 'fasta')
A.fasta
>chr16:13361561-13361573
TAGTGGGTCAGAC
>chr6_apd_hap1:2165669-2165681
AGATGAGTCATCA
>chr10:112612173-112612185
AAGTGTGTCAGCT
B.bed
chr6_apd_hap1 2165668 2165681
chr10 112612172 112612185
Expected output for C.fasta
>chr6_apd_hap1:2165669-2165681
AGATGAGTCATCA
>chr10:112612173-112612185
AAGTGTGTCAGCT
But I received error below:
long_seq_record = records[name]
KeyError: 'chr6_apd_hap1'
In your code, positions
is a defaultdict
which has as keys the names from the BED file:
>>> print positions.keys()
['chr10', 'chr6_apd_hap1']
And records
is a dictionary which has as keys the headers of the FASTA file, minus the >
at the beginning, but they still include the colon and the position on the chromosome:
>>> print records.keys()
['chr16:13361561-13361573', 'chr6_apd_hap1:2165669-2165681', 'chr10:112612173-112612185']
So you first need to transform your records
keys to loose the extra info so that you can use the positions
keys to retrieve them. You can do this by adding the following line after you've created the records
dictionary:
records = {key.split(':')[0]: value for (key, value) in records.iteritems()}
Also, the way you're currently constructing your short_seq_record
doesn't really work. Replace the lines
short_seq = str(long_seq)[start-1:stop]
short_seq_record =SeqRecord(Seq(name))+ '\t'+str(start)+'\t'+str(stop)+'\t' + SeqRecord(Seq(short_seq, alphabet))
with:
short_seq = Seq(str(long_seq)[start-1:stop], alphabet)
short_seq_id = '{0}\t{1}\t{2}'.format(name, start, stop)
short_seq_record = SeqRecord(short_seq, id=short_seq_id, description='')