xml sequence bioinformatics biopython ncbi

how to get a specific protein sequence using entrez.efetch?

I am trying to get the protein sequence from NCBI via a gene id (GI) number, using Biopython's Entrez.fetch() function.

proteina = Entrez.efetch(db="protein", id= gi, rettype="gb", retmode="xml").

I then read the data using:

proteinaXML = Entrez.read(proteina).

I can print the results, however I don't know how to get the protein sequence alone.

I can reach the protein manually once the result is displayed. Or I I check the XML tree using:

proteinaXML[0]["GBSeq_feature-table"][2]["GBFeature_quals"][6]['GBQualifier_value'].

However, depending on the GI of the protein submitted, the XML tree can differ. Making it difficult to automate this process robustly.

My question: Is it possible to retrieve only the protein sequence, and not the entire XML tree? Or alternatively: How can I extract the protein sequence from the XML file, given that the structure of XML files can differ from protein to protein?

Thanks

Solution

Good point, database entries in XML do vary between proteins submitted by various authors.

I have made an algorithm to "hunt" for protein sequences from the XML tree:

import os
import sys
from Bio import Entrez
from Bio import SeqIO

gi          = '1293613'         # example gene id                   
Entrez.email= "[email protected]"   # Always tell NCBI who you are
protina     = Entrez.efetch(db="protein", id=gi, retmode="xml") # fetch the xml
protinaXML  = Entrez.read(protina)[0]

seqs = []           # store candidate protein seqs
def seqScan(xml):   # recursively collect protein seqs
    if str(type(xml))=="<class 'Bio.Entrez.Parser.ListElement'>":
        for ele in xml:
            seqScan(ele)
    elif str(type(xml))=="<class 'Bio.Entrez.Parser.DictionaryElement'>":
        for key in xml.keys():
            seqScan(xml[key])
    elif str(type(xml))=="<class 'Bio.Entrez.Parser.StringElement'>":
        #   v___THIS IS THE KEYWORD FILTER_____v
        if (xml.startswith('M') and len(xml))>10: # 1) all proteins start with M (methionine)
            seqs.append(xml)                      # 2) filters out authors starting with M

seqScan(protinaXML) # run the recursive sequence collection
print(seqs)         # print the goods!

Note: in rare cases (depending on the "keyword filter") it may humorously grab unwanted strings such as Authors names starting with 'M' whose abbreviated names are more than 10 characters long (picture below):

enter image description here

Hope that helps!