Search code examples
pythonbioinformaticsbiopythonfastadefaultdict

Can I convert a defaultdict or dict to an ordereddict in Python?


I am trying to parse a fasta file and then I want to create another file which will contain all possible 100th sequence of ATGCN of the fasta file.

For example:

chr1_1-100:ATGC.....GC  
chr1_2-101:ATGC.....GC  
chr1_3-102:ATGC.....GC  
......................  
chr22_1-100:ATGC....cG  
chr22_2-101:ATGC....cG  
......................

I did it with the following code:

    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    records = SeqIO.to_dict(SeqIO.parse(open(i1), 'fasta'))
    with open(out, 'w') as f:
       for key in records:
     long_seq_record = records[key]
     long_seq = long_seq_record.seq
     length=len(long_seq)
     alphabet = long_seq.alphabet
     for i in range(0, length-99):  
         short_seq = str(long_seq)[i:i+100]
         text="@"+key+"_"+str(i)+"-"+str(i+100)+":"+"\n"+short_seq+"\n"+"+"+"\n"+"IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII\n"
     f.write(text)

The problem is that the written file is not ordered.means it can contain chr10 first then chr2.

The problem is there because the parsing is done using dict( e.g., SeqIO.to_dict(SeqIO.parse(open(i1), 'fasta')).

So, Can I convert the dict into an ordered dict so that my files become ordered? or is there any other way to get the solution?


Solution

  • Don't bother making any sort of dict at all. You don't need the properties a dict gives you, and you need the information the dict conversion loses. The record iterator from SeqIO.parse already gives you what you need:

    with open(i1) as infile, open(out, 'w') as f:
        for record in SeqIO.parse(infile, 'fasta'):
            # Do what you were going to do with the record.
    

    If you need the information that was in the dict key, that's record.id.