I have a sequence that has homologues in some species and the score of these homologues.
This is an example record from the gff file:
4592637 Beutenbergia_cavernae_DSM_12333 TILL 70731 70780 . 0 . clst_id=429;SubjectOrganism=Thermofilum_pendens_Hrk_5;SubjectScore=0.343373493975904;SubjectOrganism=Ignicoccus_hospitalis_KIN4_I;SubjectScore=0.323293172690763;SubjectOrganism=Burkholderia_pseudomallei_MSHR346;SubjectScore=0.343373493975904;SubjectOrganism=Burkholderia_mallei_SAVP1;SubjectScore=0.343373493975904;SubjectOrganism=Enterobacter_638;SubjectScore=0.343373493975904;SubjectOrganism=Rickettsia_felis_URRWXCal2;SubjectScore=0.343373493975904;SubjectOrganism=Gemmatimonas_aurantiaca_T_27;SubjectScore=0.343373493975904;SubjectOrganism=Streptomyces_coelicolor;SubjectScore=0.363453815261044;SubjectOrganism=Beutenbergia_cavernae_DSM_12333;SubjectScore=1;SubjectOrganism=Kocuria_rhizophila_DC2201;SubjectScore=0.343373493975904;SubjectOrganism=Rhodococcus_jostii_RHA1;SubjectScore=0.383534136546185;SubjectOrganism=Symbiobacterium_thermophilum_IAM14863;SubjectScore=0.363453815261044;
==>4592637 => NAPP(Nucleic Acid Phylogenetic Profiling database) id of sequence (not genbank id)
==>Beutenbergia_cavernae_DSM_12333 => specie name of sequence
==>TILL => type of sequence
==>70731 .. 70780 => start and end of sequence
==>clst_id=429 => is the id of cluster of this sequence
==>SubjectOrganism => name of specie that sequence has homologues with it
==>SubjectScore => score of homologues of sequence with this specie ( Blastn score )
I want to extract the sequence from the SubjectOrganism
where the sequence(4592637) has similarities.
How can I extract the sequence from a genome where a sequence has homologues using Python?
From the other question, I suppose you already have figured this out. If this is the case, StackOverflow encourages you to answer your own questions, post and accept them! Anyway:
Firstly you fetch your query sequences, replacing the id
with the the id of your organism. I found it querying the NCBI with "Beutenbergia cavernae DSM 12333":
from Bio import Entrez
seq = Entrez.efetch(db="nuccore",
id="229564415",
rettype="fasta",
seq_start=70731,
seq_stop=70780).readlines()
Now seq
contains something like
['>gb|CP001618.1|:70731-70780 Beutenbergia cavernae DSM 12333,'
'complete genome\n',
'GCCCGAGTTCCCCGAACCGTGCCGAGGTAGTACTCCACGGGCGAGGGAGT\n',
'\n']
Use this sequence to launch the qblast, as shown in the other question, but replacing the hardcoded entrez_query
with the strings from the GFF file:
from Bio.Blast.NCBIWWW import qblast
results = qblast("blastn",
"nr",
"".join(seq),
entrez_query='Thermofilum_pendens_Hrk_5')
Be careful, as with thousands of queries the NCBI are surely banning you from the queues.