Search code examples
pythonbioinformaticsblast

Finding the best reciprocal hit in a single BLAST file using python


I have a BLAST outfmt 6 output file in standard format, I want to find a way to loop through the file, select each hit, find its reciprocal hit and decipher which is the best hit to store.

For example:

d = {}
for line in input_file:
    term = line.split('\t')
    qseqid = term[0]
    sseqid = term[1]
    hit = qseqid, sseqid
    recip_hit = sseqid, qseqid
    for line in input_file:
        if recip_hit in line:
            compare both lines
done

Example input (tab delimited):

Seq1    Seq2    80    1000   10    3   1    1000    100    1100    0.0    500
Seq2    Seq1    95    1000   10    3   100    1100    1    1000    1e-100    500

Can anyone provide any insight how to efficiently tackle this problem?

Many thanks in advance


Solution

  • You could approach your problem to find those pairs and compare the lines like this:

    #create a dictionary to store pairs
    line_dict = {}
    #iterate over your file
    for line in open("test.txt", "r"):
        line = line[:-1].split("\t")
        #ignore line, if not at least one value apart from the two sequence IDs
        if len(line) < 3:
            continue
        #identify the two sequences
        seq = tuple(line[0:2])
        #is reverse sequence already in dictionary?
        if seq[::-1] in line_dict:
            #append new line
            line_dict[seq[::-1]].append(line)
        else:
            #create new entry
            line_dict[seq] = [line]
    
    #remove entries, for which no counterpart exists
    pairs = {k: v for k, v in line_dict.items() if len(v) > 1}
    
    #and do things with these pairs
    for pair, seq in pairs.items():
        print(pair, "found in:")
        for item in seq:
            print(item)
    

    The advantage is that you only have to iterate once over your file, because you store all data and discard them only, if you haven't found a matching reversed pair. The disadvantage is that this takes space, so for very large files, this approach might not be feasible.

    A similar approach - to store all data in your working memory - utilises pandas. This should be faster, since sorting algorithms are optimised for pandas. Another advantage of pandas is that all your other values are already in pandas columns - so further analysis is made easier. I definitely prefer the pandas version, but I don't know, if it is installed on your system. To make things easier to communicate, I assigned a and b to the columns that contain the sequences Seq1 and Seq2.

    import pandas as pd
    #read data into a dataframe
    #not necessary: drop the header of the file, use custom columns names
    df = pd.read_csv("test.txt", sep='\t', names=list("abcde"), header = 0)
    
    #create a column that joins Seq1 - Seq2 or Seq2 - Seq1 to Seq1Seq2
    df["pairs"] = df.apply(lambda row: ''.join(sorted([row["a"], row["b"]])), axis = 1)
    #remove rows with no matching pair and sort the database
    only_pairs = df[df["pairs"].duplicated(keep = False)].sort_values(by = "pairs")
    
    print(only_pairs)