Search code examples
pythonbioinformaticsbiopythonncbi

Retrieve DNA sequence using a gene identifier of a protein


Im using Biopython to try to retrieve the DNA sequence corresponding to protein of which I have a GI(71743840), from the NCBI page this is very easy, I just need to look for the refseq. My problem comes when coding it in python, using ncbi fetch utilities, I can't find a way to retrieve any field that would help me to go to DNA.

handle = Entrez.efetch(db="nucleotide", id=blast_record.alignments[0].hit_id, rettype="gb", retmode="text")
seq_record=SeqIO.read(handle,"gb")

There is a lot of information in seq_record.features, but there must be an easier and obvious way to do this, any help would be appreciated. Thnx!


Solution

  • You can try to access the annotations of the SeqRecord:

    seq_record=SeqIO.read(handle,"gb")
    nucleotide_accession = seq_record.annotations["db_source"]
    

    In your case nucleotide_accession is "REFSEQ: accession NM_000673.4"

    Now look if you can parse those annotations. With only this test case:

    nucl_id = nucleotide_accession.split()[-1]
    
    handle = Entrez.efetch(db="nucleotide",
                           id=nucl_id,
                           rettype="gb",
                           retmode="text")
    seq_record = SeqIO.read(handle, "gb")