Search code examples
biopythonblastncbi

Finding similar DNA sequence in a specific organism with Biopython's Blast module


I am trying to find a DNA sequence homologous to a given coding sequence in a specific organism (E. fergusoni, to root an E. coli genes phylogeny) I want to blast one of my focal species' sequence and get back all sequence from E. fergusoni that have between 80% and 98% similarity on the whole segment. This work prety well on the NCBI online version of blast, by setting "Escherichia fergusonii (taxid:564)" in the "Organism" field. However, I want to automatise this process to be able to run it quickly on a large set of disinct E. coli genes.

I want to use the NCBIWWW.qblast function from the Blast module with argument "entrez_query" to specify the organism I want to restrain my research to.

This is what I tried :


Blast.email = [email protected]
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

sequence_data = open("gene0.fasta").read()
result_handle = NCBIWWW.qblast(
    program="blastn",
    database="nt",
    sequence=sequence_data,
    # entrez_query = "txid564[Organism]",
    # entrez_query = "Escherichia fergusonii (taxid:564)[Organism]",
    # entrez_query = "txid564[ORGN]",
    # entrez_query = "564[Taxid]",
    format_type="XML",
)

with open("gene0_results.xml", "w") as out_handle:
    out_handle.write(result_handle.read())



for record in NCBIXML.parse(open("gene0_results.xml")): 
    if record.alignments: 
        print("\n") 
        print("query: %s" % record.query[:100]) 
        for align in record.alignments: 
            for hsp in align.hsps: 
                if (hsp.identities/len(hsp.query) < 0.98) and (hsp.identities/len(hsp.query) >= 0.8)  : 
                    print("match: %s " % align.title[:100])
                    print(f"length: {align.length}")
                    print(f"e value: {hsp.expect}")
                    print(f"identity : {hsp.identities/len(hsp.query)} ({hsp.identities} nucleotides over {len(hsp.query)})")
                    print(hsp.query[0:75] + "...")
                    print(hsp.match[0:75] + "...")
                    print(hsp.sbjct[0:75] + "...")

When I run this script without uncommenting the entrez_query arguments, I get the E. coli sequences that are 100% similar to my query, in about 1 mn. But when I uncomment one of them, I don't get any result at all (I also checked the XML file to see if my later filtering was not the reason), and the request can take up to 20mn. I am looking for the correct syntax to specify the organism with entrez_query (I tried different ones based on what I saw on other posts or what seemed intuitive) but none of them give any result, contrarily to the internet blast version.

I am using Python 3.11 and Biopython 1.83 in a Jupyter Notebook with VSC 1.89.1


Solution

  • This worked for me and matched only those sequences with Escherichia fergusoni in the description line:

    from Bio.Blast import NCBIWWW
    from Bio.Blast import NCBIXML
    NCBIWWW.email = "[email protected]" # based on https://biopython.org/DIST/docs/tutorial/Tutorial.html#sec178
    
    sequence_data = open("gene0.fasta").read()
    result_handle = NCBIWWW.qblast(
        program="blastn",
        database="nt",
        sequence=sequence_data,
        # entrez_query = "txid564[Organism]",
        # entrez_query = "Escherichia fergusonii (taxid:564)[Organism]",
        entrez_query = "txid564[ORGN]", # best match to https://www.biostars.org/p/439571/#439707
        # entrez_query = "564[Taxid]",
        format_type="XML",
    )
    
    with open("gene0_results.xml", "w") as out_handle:
        out_handle.write(result_handle.read())
    
    
    
    for record in NCBIXML.parse(open("gene0_results.xml")): 
        if record.alignments: 
            print("\n") 
            print("query: %s" % record.query[:100]) 
            for align in record.alignments: 
                for hsp in align.hsps: 
                    if (hsp.identities/len(hsp.query) < 0.98) and (hsp.identities/len(hsp.query) >= 0.8)  : 
                        print("match: %s " % align.title[:100])
                        print(f"length: {align.length}")
                        print(f"e value: {hsp.expect}")
                        print(f"identity : {hsp.identities/len(hsp.query)} ({hsp.identities} nucleotides over {len(hsp.query)})")
                        print(hsp.query[0:75] + "...")
                        print(hsp.match[0:75] + "...")
                        print(hsp.sbjct[0:75] + "...")
    

    I see results listed with matches to the DNA I arbitrarily selected. It doesn't take 20 minutes; however, it isn't instant either.

    You need an email or you get throttled. Your code didn't seen to handle that appropriately? Maybe that was where you were encountering issues? Update what I posted above with your email.
    For the taxon identifier I followed the example from this Biostar's post and uncommented your line entrez_query = "txid564[ORGN]".
    Other than that, I used your code.

    If you want to make it run faster and only want to query one (or a few ) of the Escherichia fergusoni genomes, see the 'Available Notebooks' listed if you go to my blast-binder repo and click on the 'launch binder' badge to get a temporary session in your browser with things already to go. You can essentially just make a database of a genome and search that. It is more efficient because you are essentially running BLAST locally. The later notebooks show integrating with Python better as the earlier ones are just to build up showing it works on the command line.