Finding the best reciprocal hit in a single BLAST file using python

I have a BLAST outfmt 6 output file in standard format, I want to find a way to loop through the file, select each hit, find its reciprocal hit and decipher which is the best hit to store.

For example:

d = {}
for line in input_file:
    term = line.split('\t')
    qseqid = term[0]
    sseqid = term[1]
    hit = qseqid, sseqid
    recip_hit = sseqid, qseqid
    for line in input_file:
        if recip_hit in line:
            compare both lines
done

Example input (tab delimited):

Seq1    Seq2    80    1000   10    3   1    1000    100    1100    0.0    500
Seq2    Seq1    95    1000   10    3   100    1100    1    1000    1e-100    500

Can anyone provide any insight how to efficiently tackle this problem?

Many thanks in advance

Solution

You could approach your problem to find those pairs and compare the lines like this:

#create a dictionary to store pairs
line_dict = {}
#iterate over your file
for line in open("test.txt", "r"):
    line = line[:-1].split("\t")
    #ignore line, if not at least one value apart from the two sequence IDs
    if len(line) < 3:
        continue
    #identify the two sequences
    seq = tuple(line[0:2])
    #is reverse sequence already in dictionary?
    if seq[::-1] in line_dict:
        #append new line
        line_dict[seq[::-1]].append(line)
    else:
        #create new entry
        line_dict[seq] = [line]

#remove entries, for which no counterpart exists
pairs = {k: v for k, v in line_dict.items() if len(v) > 1}

#and do things with these pairs
for pair, seq in pairs.items():
    print(pair, "found in:")
    for item in seq:
        print(item)

The advantage is that you only have to iterate once over your file, because you store all data and discard them only, if you haven't found a matching reversed pair. The disadvantage is that this takes space, so for very large files, this approach might not be feasible.

A similar approach - to store all data in your working memory - utilises pandas. This should be faster, since sorting algorithms are optimised for pandas. Another advantage of pandas is that all your other values are already in pandas columns - so further analysis is made easier. I definitely prefer the pandas version, but I don't know, if it is installed on your system. To make things easier to communicate, I assigned a and b to the columns that contain the sequences Seq1 and Seq2.

import pandas as pd
#read data into a dataframe
#not necessary: drop the header of the file, use custom columns names
df = pd.read_csv("test.txt", sep='\t', names=list("abcde"), header = 0)

#create a column that joins Seq1 - Seq2 or Seq2 - Seq1 to Seq1Seq2
df["pairs"] = df.apply(lambda row: ''.join(sorted([row["a"], row["b"]])), axis = 1)
#remove rows with no matching pair and sort the database
only_pairs = df[df["pairs"].duplicated(keep = False)].sort_values(by = "pairs")

print(only_pairs)