Search code examples
xmlsequencebioinformaticsbiopythonncbi

how to get a specific protein sequence using entrez.efetch?


I am trying to get the protein sequence from NCBI via a gene id (GI) number, using Biopython's Entrez.fetch() function.

proteina = Entrez.efetch(db="protein", id= gi, rettype="gb", retmode="xml").

I then read the data using:

proteinaXML = Entrez.read(proteina).

I can print the results, however I don't know how to get the protein sequence alone.

I can reach the protein manually once the result is displayed. Or I I check the XML tree using:

proteinaXML[0]["GBSeq_feature-table"][2]["GBFeature_quals"][6]['GBQualifier_value'].

However, depending on the GI of the protein submitted, the XML tree can differ. Making it difficult to automate this process robustly.

My question: Is it possible to retrieve only the protein sequence, and not the entire XML tree? Or alternatively: How can I extract the protein sequence from the XML file, given that the structure of XML files can differ from protein to protein?

Thanks


Solution

  • Good point, database entries in XML do vary between proteins submitted by various authors.

    I have made an algorithm to "hunt" for protein sequences from the XML tree:

    import os
    import sys
    from Bio import Entrez
    from Bio import SeqIO
    
    gi          = '1293613'         # example gene id                   
    Entrez.email= "[email protected]"   # Always tell NCBI who you are
    protina     = Entrez.efetch(db="protein", id=gi, retmode="xml") # fetch the xml
    protinaXML  = Entrez.read(protina)[0]
    
    seqs = []           # store candidate protein seqs
    def seqScan(xml):   # recursively collect protein seqs
        if str(type(xml))=="<class 'Bio.Entrez.Parser.ListElement'>":
            for ele in xml:
                seqScan(ele)
        elif str(type(xml))=="<class 'Bio.Entrez.Parser.DictionaryElement'>":
            for key in xml.keys():
                seqScan(xml[key])
        elif str(type(xml))=="<class 'Bio.Entrez.Parser.StringElement'>":
            #   v___THIS IS THE KEYWORD FILTER_____v
            if (xml.startswith('M') and len(xml))>10: # 1) all proteins start with M (methionine)
                seqs.append(xml)                      # 2) filters out authors starting with M
    
    seqScan(protinaXML) # run the recursive sequence collection
    print(seqs)         # print the goods!
    

    Note: in rare cases (depending on the "keyword filter") it may humorously grab unwanted strings such as Authors names starting with 'M' whose abbreviated names are more than 10 characters long (picture below):

    enter image description here

    Hope that helps!