Search code examples
pythonbioinformaticsbiopythonfasta

extract several fasta file with a list of ID (in order)


I have a file with several names such :

seq1 seq9 seq3 seq7 seq5 seqi seqn....

and another fasta file with all my sequences, and what I need to do is to order my sequences by the order of the list above:such as:

>seq1
aaaaa
>seq9
aaaaa
>seq3
aaaaa
>seq7
aaaaa
>seq5
aaaaa
...

I tried this:

input_file = open('concatenate_0035_0042_aa2.fa','r')
output_file = open('result.fasta','a')


liste=['seq1','seq5','seq8' etc]
print(len(liste))
compteur=1
for i in liste:
    record_dict = SeqIO.to_dict(SeqIO.parse("concatenate_0035_0042_aa2.fa", "fasta"))
    print(">",record_dict[i].id,file=output_file,sep="")
    print(record_dict[i].seq,file=output_file)
    compteur+=1
    print(compteur)

output_file.close()
input_file.close()

but it actually takes too much time.


Solution

  • The reason that your current code takes too much time, is because for each sequence id in your list, you parse your fasta file and convert it to a dict. Certainly if your fasta file is large, this is an expensive computation. So only do it once:

    from Bio import SeqIO
    
    ids = ['seq1', 'seq9', 'seq3', 'seq7', 'seq5'] 
    with open('concatenate_0035_0042_aa2.fa') as seqs, open('result.fasta', 'w') as result:
        record_dict = SeqIO.to_dict(SeqIO.parse(seqs, 'fasta'))
        result_records = [record_dict[id_] for id_ in ids]
        SeqIO.write(result_records, result, "fasta")
    

    the with open(...) statement takes care of automatically closing the file for you when you're done.