Search code examples
pythonbiopythonfastancbi

How can I return corresponding fasta protein sequences from ncbi from multiple accession numbers in python?


I'm having some difficulty downloading fasta sequences for multiple accession numbers in a text file using a python script. I can do this OK for a single accession number e.g:

import sys
from Bio import Entrez
Entrez.email = "[email protected]"
handle = Entrez.efetch(db="protein", id="EAS03220", rettype="fasta")
print(handle.read())

But when I try to give it a file as a list (see below) then I get errors.

import sys
from Bio import Entrez
Entrez.email = "[email protected]"    

accessions = []
for line in open(sys.argv[1],"r"):
    line = line.strip()
    accessions.append(line)

for num in accessions:
    handle = Entrez.efetch(db="protein", id="num", rettype="fasta")
    print(handle.read())

Here's and example of how my input file looks:

EAS06781
EAS07087
EAS07113
EAS07200
EAS07226
EAS07230

I'm sure the solution is easy but I've been reading forums, ncbi help-pages and python for beginners books for hours and getting nowhere! Thanks in advance.


Solution

  • You are passing num as a string, not as a variable. Try removing the quotation marks and it should work.

    handle = Entrez.efetch(db="protein", id=num, rettype="fasta")