Search code examples
sequencebioinformaticsbiopythonncbi

Querying NCBI for a sequence from ncbi via Biopython


How can I query NCBI for sequences given a chromosome's Genbank identifier, and start and stop positions using Biopython?

CP001665    NAPP    TILE    6373    6422    .   +   .   cluster=9; 
CP001665    NAPP    TILE    6398    6447    .   +   .   cluster=3; 
CP001665    NAPP    TILE    6423    6472    .   +   .   cluster=3; 
CP001665    NAPP    TILE    6448    6497    .   +   .   cluster=3;
CP001665    NAPP    TILE    7036    7085    .   +   .   cluster=10; 
CP001665    NAPP    TILE    7061    7110    .   +   .   cluster=3; 
CP001665    NAPP    TILE    7073    7122    .   +   .   cluster=3;

Solution

  • from Bio import Entrez
    from Bio import SeqIO
    
    Entrez.email = "sample@example.org"
    
    handle = Entrez.efetch(db="nuccore",
                           id="CP001665",
                           rettype="gb",
                           retmode="text")
    
    whole_sequence = SeqIO.read(handle, "genbank")
    
    print whole_sequence[6373:6422]
    

    Once you know the id and the database to fetch from, use Entrez.efetch to get a handle to that file. You should specify the returning type (rettype="gb") and the mode (retmode="text"), to get a handler to the filelike data.

    Then pass this handler to SeqIO, which should return a SeqRecord object. One nice feature of the SeqRecords is that they can be cleanly sliced as lists. If you can retrieve the starting and ending points from somewhere, the above print statement returns:

    ID: CP001665.1
    Name: CP001665
    Description: Escherichia coli 'BL21-Gold(DE3)pLysS AG', complete genome.
    Number of features: 0
    Seq('GCGCTAACCATGCGAGCGTGCCTGATGCGCTACGCTTATCAGGCCTACG', IUPACAmbiguousDNA())