python extract sequence biopython genome

How do I extract a sequence from a genome based on homologues with a sequence?

I have a sequence that has homologues in some species and the score of these homologues.

This is an example record from the gff file:

4592637 Beutenbergia_cavernae_DSM_12333 TILL    70731   70780   .   0   .   clst_id=429;SubjectOrganism=Thermofilum_pendens_Hrk_5;SubjectScore=0.343373493975904;SubjectOrganism=Ignicoccus_hospitalis_KIN4_I;SubjectScore=0.323293172690763;SubjectOrganism=Burkholderia_pseudomallei_MSHR346;SubjectScore=0.343373493975904;SubjectOrganism=Burkholderia_mallei_SAVP1;SubjectScore=0.343373493975904;SubjectOrganism=Enterobacter_638;SubjectScore=0.343373493975904;SubjectOrganism=Rickettsia_felis_URRWXCal2;SubjectScore=0.343373493975904;SubjectOrganism=Gemmatimonas_aurantiaca_T_27;SubjectScore=0.343373493975904;SubjectOrganism=Streptomyces_coelicolor;SubjectScore=0.363453815261044;SubjectOrganism=Beutenbergia_cavernae_DSM_12333;SubjectScore=1;SubjectOrganism=Kocuria_rhizophila_DC2201;SubjectScore=0.343373493975904;SubjectOrganism=Rhodococcus_jostii_RHA1;SubjectScore=0.383534136546185;SubjectOrganism=Symbiobacterium_thermophilum_IAM14863;SubjectScore=0.363453815261044;

==>4592637 => NAPP(Nucleic Acid Phylogenetic Profiling database) id of sequence (not genbank id)

==>Beutenbergia_cavernae_DSM_12333 => specie name of sequence

==>TILL => type of sequence

==>70731 .. 70780 => start and end of sequence

==>clst_id=429 => is the id of cluster of this sequence

==>SubjectOrganism => name of specie that sequence has homologues with it

==>SubjectScore => score of homologues of sequence with this specie ( Blastn score )

I want to extract the sequence from the SubjectOrganism where the sequence(4592637) has similarities.

How can I extract the sequence from a genome where a sequence has homologues using Python?

Solution

From the other question, I suppose you already have figured this out. If this is the case, StackOverflow encourages you to answer your own questions, post and accept them! Anyway:

Firstly you fetch your query sequences, replacing the id with the the id of your organism. I found it querying the NCBI with "Beutenbergia cavernae DSM 12333":

from Bio import Entrez
seq = Entrez.efetch(db="nuccore",
                    id="229564415",
                    rettype="fasta",
                    seq_start=70731,
                    seq_stop=70780).readlines()

Now seq contains something like

['>gb|CP001618.1|:70731-70780 Beutenbergia cavernae DSM 12333,'
 'complete genome\n',
 'GCCCGAGTTCCCCGAACCGTGCCGAGGTAGTACTCCACGGGCGAGGGAGT\n',
 '\n']

Use this sequence to launch the qblast, as shown in the other question, but replacing the hardcoded entrez_query with the strings from the GFF file:

from Bio.Blast.NCBIWWW import qblast
results = qblast("blastn",
                 "nr",
                 "".join(seq),
                 entrez_query='Thermofilum_pendens_Hrk_5')

Be careful, as with thousands of queries the NCBI are surely banning you from the queues.