When I run the following;
from Bio.Blast import NCBIWWW
from Bio import Entrez, SeqIO
Entrez.email = "[email protected]"
handle = Entrez.efetch(db="Protein", id= "75192198", rettype = "xml")
record = Entrez.read(handle)
I get back a "Bio.Entrez.Parser.DictionaryElement" that is really difficult to search through. If I want to say get the the get the amino acid sequence I have to type something like this;
record["Bioseq-set_seq-set"][0]["Seq-entry_seq"]["Bioseq"]["Bioseq_inst"]["Seq-inst"]["Seq-inst_seq-data"]["Seq-data"]["Seq-data_iupacaa"]["IUPACaa"]
I know that there has to be an easier way to index the elements in these results. If anyone out there can lend me a hand with this I'd appreciate it very much.
If what you want is the sequence, then instead of querying it in "xml" format, query it in (for example) FASTA format, by changing the rettype
argument. Then it's as simple as parsing it using SeqIO.
handle = Entrez.efetch(db="Protein", id= "75192198", rettype = "fasta")
for r in SeqIO.parse(handle, "fasta"):
print r.id, r.seq
This works because the contents of handle
look like:
print handle.read()
# >gi|75192198|sp|Q9MAH8.1|TCP3_ARATH RecName: Full=Transcription factor TCP3
# MAPDNDHFLDSPSPPLLEMRHHQSATENGGGCGEIVEVQGGHIVRSTGRKDRHSKVCTAKGPRDRRVRLS
# APTAIQFYDVQDRLGFDRPSKAVDWLITKAKSAIDDLAQLPPWNPADTLRQHAAAAANAKPRKTKTLISP
# PPPQPEETEHHRIGEEEDNESSFLPASMDSDSIADTIKSFFPVASTQQSYHHQPPSRGNTQNQDLLRLSL
# QSFQNGPPFPNQTEPALFSGQSNNQLAFDSSTASWEQSHQSPEFGKIQRLVSWNNVGAAESAGSTGGFVF
# ASPSSLHPVYSQSQLLSQRGPLQSINTPMIRAWFDPHHHHHHHQQSMTTDDLHHHHPYHIPPGIHQSAIP
# GIAFASSGEFSGFRIPARFQGEQEEHGGDNKPSSASSDSRH
If you still want some of the other meta information (such as transcription factor binding sites within the gene, or the taxonomy of the organism), you can also download it in genbank format by giving the argument rettype="gb"
and parsing with "gb"
. You can learn more about that in the example here.