Search code examples
pythonextractsequencebiopythongenome

How do I extract a sequence from a genome based on homologues with a sequence?


I have a sequence that has homologues in some species and the score of these homologues.

This is an example record from the gff file:

4592637 Beutenbergia_cavernae_DSM_12333 TILL    70731   70780   .   0   .   clst_id=429;SubjectOrganism=Thermofilum_pendens_Hrk_5;SubjectScore=0.343373493975904;SubjectOrganism=Ignicoccus_hospitalis_KIN4_I;SubjectScore=0.323293172690763;SubjectOrganism=Burkholderia_pseudomallei_MSHR346;SubjectScore=0.343373493975904;SubjectOrganism=Burkholderia_mallei_SAVP1;SubjectScore=0.343373493975904;SubjectOrganism=Enterobacter_638;SubjectScore=0.343373493975904;SubjectOrganism=Rickettsia_felis_URRWXCal2;SubjectScore=0.343373493975904;SubjectOrganism=Gemmatimonas_aurantiaca_T_27;SubjectScore=0.343373493975904;SubjectOrganism=Streptomyces_coelicolor;SubjectScore=0.363453815261044;SubjectOrganism=Beutenbergia_cavernae_DSM_12333;SubjectScore=1;SubjectOrganism=Kocuria_rhizophila_DC2201;SubjectScore=0.343373493975904;SubjectOrganism=Rhodococcus_jostii_RHA1;SubjectScore=0.383534136546185;SubjectOrganism=Symbiobacterium_thermophilum_IAM14863;SubjectScore=0.363453815261044;

==>4592637 => NAPP(Nucleic Acid Phylogenetic Profiling database) id of sequence (not genbank id)

==>Beutenbergia_cavernae_DSM_12333 => specie name of sequence

==>TILL => type of sequence

==>70731 .. 70780 => start and end of sequence

==>clst_id=429 => is the id of cluster of this sequence

==>SubjectOrganism => name of specie that sequence has homologues with it

==>SubjectScore => score of homologues of sequence with this specie ( Blastn score )

I want to extract the sequence from the SubjectOrganism where the sequence(4592637) has similarities.

How can I extract the sequence from a genome where a sequence has homologues using Python?


Solution

  • From the other question, I suppose you already have figured this out. If this is the case, StackOverflow encourages you to answer your own questions, post and accept them! Anyway:

    Firstly you fetch your query sequences, replacing the id with the the id of your organism. I found it querying the NCBI with "Beutenbergia cavernae DSM 12333":

    from Bio import Entrez
    seq = Entrez.efetch(db="nuccore",
                        id="229564415",
                        rettype="fasta",
                        seq_start=70731,
                        seq_stop=70780).readlines()
    

    Now seq contains something like

    ['>gb|CP001618.1|:70731-70780 Beutenbergia cavernae DSM 12333,'
     'complete genome\n',
     'GCCCGAGTTCCCCGAACCGTGCCGAGGTAGTACTCCACGGGCGAGGGAGT\n',
     '\n']
    

    Use this sequence to launch the qblast, as shown in the other question, but replacing the hardcoded entrez_query with the strings from the GFF file:

    from Bio.Blast.NCBIWWW import qblast
    results = qblast("blastn",
                     "nr",
                     "".join(seq),
                     entrez_query='Thermofilum_pendens_Hrk_5')
    

    Be careful, as with thousands of queries the NCBI are surely banning you from the queues.