Search code examples
pythonbioinformaticsbiopythonblast

Removing BLAST hits with multiple HSPs


I have a large BLAST (outfmt 6) output file. I am interested in finding reciprocal homologs within this file but I want to exclude hits with multiple HSPs, eg.

Seq1    Seq2    (alignment 1: evalue bitscore etc)
Seq1    Seq2    (alignment 2: evalue bitscore etc)
Seq3    Seq4    (alignment 1: evalue bitscore etc)
Seq4    Seq5    (alignment 1: evalue bitscore etc)
Seq2    Seq1    (alignment 1: evalue bitscore etc)
Seq2    Seq1    (alignment 2: evalue bitscore etc)
Seq4    Seq3    (alignment 1: evalue bitscore etc)

In this instance only the alignments between sequences 3 and 4 would be returned as the alignments between 1 and 2 share multiple HSPs and the alignments between 4 and 5 only have a one-directional hit. I am hoping to do this in python so I can plug it in with the rest of my program.

Can anybody advise on any threads (etc) that might lead me in the right direction?

Thanks!


Solution

  • Python would be just fine for this.

    from collections import defaultdict
    
    hsp_count = defaultdict(int)
    for line in open("file"):
        seq1, seq2, _ = line.split(maxsplit=2)
        hsp_count[seq1, seq2] += 1
    
    already_checked = set()
    for (seq1, seq2), val1 in hsp_count.items():
        if (seq2, seq1) in already_checked:
            continue
        val2 = hsp_count.get((seq2, seq1))
        if not val2:
            continue
        already_checked.add((seq1, seq2))
        if val1 == 1 and val2 == 1:
            print(seq1, seq2)