How can I query NCBI for sequences given a chromosome's Genbank identifier, and start and stop positions using Biopython?
CP001665 NAPP TILE 6373 6422 . + . cluster=9;
CP001665 NAPP TILE 6398 6447 . + . cluster=3;
CP001665 NAPP TILE 6423 6472 . + . cluster=3;
CP001665 NAPP TILE 6448 6497 . + . cluster=3;
CP001665 NAPP TILE 7036 7085 . + . cluster=10;
CP001665 NAPP TILE 7061 7110 . + . cluster=3;
CP001665 NAPP TILE 7073 7122 . + . cluster=3;
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "sample@example.org"
handle = Entrez.efetch(db="nuccore",
id="CP001665",
rettype="gb",
retmode="text")
whole_sequence = SeqIO.read(handle, "genbank")
print whole_sequence[6373:6422]
Once you know the id
and the database to fetch from, use Entrez.efetch
to get a handle to that file. You should specify the returning type (rettype="gb"
) and the mode (retmode="text"
), to get a handler to the filelike data.
Then pass this handler to SeqIO
, which should return a SeqRecord
object. One nice feature of the SeqRecord
s is that they can be cleanly sliced as lists. If you can retrieve the starting and ending points from somewhere, the above print
statement returns:
ID: CP001665.1
Name: CP001665
Description: Escherichia coli 'BL21-Gold(DE3)pLysS AG', complete genome.
Number of features: 0
Seq('GCGCTAACCATGCGAGCGTGCCTGATGCGCTACGCTTATCAGGCCTACG', IUPACAmbiguousDNA())