I have a BLAST outfmt 6 output file in standard format, I want to find a way to loop through the file, select each hit, find its reciprocal hit and decipher which is the best hit to store.
For example:
d = {}
for line in input_file:
term = line.split('\t')
qseqid = term[0]
sseqid = term[1]
hit = qseqid, sseqid
recip_hit = sseqid, qseqid
for line in input_file:
if recip_hit in line:
compare both lines
done
Example input (tab delimited):
Seq1 Seq2 80 1000 10 3 1 1000 100 1100 0.0 500
Seq2 Seq1 95 1000 10 3 100 1100 1 1000 1e-100 500
Can anyone provide any insight how to efficiently tackle this problem?
Many thanks in advance
You could approach your problem to find those pairs and compare the lines like this:
#create a dictionary to store pairs
line_dict = {}
#iterate over your file
for line in open("test.txt", "r"):
line = line[:-1].split("\t")
#ignore line, if not at least one value apart from the two sequence IDs
if len(line) < 3:
continue
#identify the two sequences
seq = tuple(line[0:2])
#is reverse sequence already in dictionary?
if seq[::-1] in line_dict:
#append new line
line_dict[seq[::-1]].append(line)
else:
#create new entry
line_dict[seq] = [line]
#remove entries, for which no counterpart exists
pairs = {k: v for k, v in line_dict.items() if len(v) > 1}
#and do things with these pairs
for pair, seq in pairs.items():
print(pair, "found in:")
for item in seq:
print(item)
The advantage is that you only have to iterate once over your file, because you store all data and discard them only, if you haven't found a matching reversed pair. The disadvantage is that this takes space, so for very large files, this approach might not be feasible.
A similar approach - to store all data in your working memory - utilises pandas. This should be faster, since sorting algorithms are optimised for pandas. Another advantage of pandas is that all your other values are already in pandas columns - so further analysis is made easier. I definitely prefer the pandas version, but I don't know, if it is installed on your system. To make things easier to communicate, I assigned a
and b
to the columns that contain the sequences Seq1
and Seq2
.
import pandas as pd
#read data into a dataframe
#not necessary: drop the header of the file, use custom columns names
df = pd.read_csv("test.txt", sep='\t', names=list("abcde"), header = 0)
#create a column that joins Seq1 - Seq2 or Seq2 - Seq1 to Seq1Seq2
df["pairs"] = df.apply(lambda row: ''.join(sorted([row["a"], row["b"]])), axis = 1)
#remove rows with no matching pair and sort the database
only_pairs = df[df["pairs"].duplicated(keep = False)].sort_values(by = "pairs")
print(only_pairs)