I am trying to parse a fasta file and then I want to create another file which will contain all possible 100th sequence of ATGCN of the fasta file.
For example:
chr1_1-100:ATGC.....GC
chr1_2-101:ATGC.....GC
chr1_3-102:ATGC.....GC
......................
chr22_1-100:ATGC....cG
chr22_2-101:ATGC....cG
......................
I did it with the following code:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
records = SeqIO.to_dict(SeqIO.parse(open(i1), 'fasta'))
with open(out, 'w') as f:
for key in records:
long_seq_record = records[key]
long_seq = long_seq_record.seq
length=len(long_seq)
alphabet = long_seq.alphabet
for i in range(0, length-99):
short_seq = str(long_seq)[i:i+100]
text="@"+key+"_"+str(i)+"-"+str(i+100)+":"+"\n"+short_seq+"\n"+"+"+"\n"+"IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII\n"
f.write(text)
The problem is that the written file is not ordered.means it can contain chr10
first then chr2
.
The problem is there because the parsing is done using dict(
e.g., SeqIO.to_dict(SeqIO.parse(open(i1), 'fasta'))
.
So, Can I convert the dict into an ordered dict so that my files become ordered? or is there any other way to get the solution?
Don't bother making any sort of dict at all. You don't need the properties a dict gives you, and you need the information the dict conversion loses. The record iterator from SeqIO.parse
already gives you what you need:
with open(i1) as infile, open(out, 'w') as f:
for record in SeqIO.parse(infile, 'fasta'):
# Do what you were going to do with the record.
If you need the information that was in the dict key, that's record.id
.