Search code examples
pythonif-statementparallel-processingbiopythonjoblib

Joblib too slow using "if not in" loop


I am working with amino acid sequences using the Biopython parser, but regardless of data format (the format is fasta, that is, you can imagine them as strings of letters as follows preceded by the id), my problem is that I have a huge amount of data and despite having tried to parallelize with joblib the estimate of the hours it would take me to run this simple code is 400.

Basically I have a file that contains a series of ids that I have to remove (ids_to_drop) from the original dataset (original_dataset), to create a new file (new_dataset) that contains all the ids contained in the original dataset without the ids_to_drop.

I've tried them all but I don't know how else to do it and I'm stuck right now. Thanks so much!

def file_without_ids_to_remove(seq):
    with open(new_output, "a") as f, open(ids_to_drop, "r") as r: #output #removing file 
        remove = r.read().split("\n")
        if seq.id not in remove:
            SeqIO.write(seq, f, "fasta")
    

Parallel(n_jobs=10)(delayed(file_without_ids_to_remove)(seq) for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')))

To be clear this is an example of the data (sequence.id + sequence):

WP_051064487.1 MSSAAQTPEATSDVSDANAKQAEALRVASVNVNGIRASYRKGMAEWLAPRQVDILCLQEVRAPDEVVDGF LADDWHIVHAEAEAKGRAGVLIASRKDSLAPDATRIGIGEEYFATAGRWVEADYTIGENAKKLTVISAYV HSGEVGTQRQEDKYRFLDTMLERMAELAEQSDYALIVGDLNVGHTELDIKNWKGNVKNAGFLPEERAYFD KFFGGGDTPGGLGWKDVQRELAGPVNGPYTWWSQRGQAFDNDTGWRIDYHMATPELFARAGNAVVDRAPS YAERWSDHAPLLVDYTIR

UPDATE: I tried in the following way after the suggestion and it works.

with open(new_dataset, "w") as filtered:
    [SeqIO.write(seq,filtered,"fasta") for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')) if seq.id not in ids_to_remove]

   

Solution

  • This looks like a simple file filter operation. Turn the ids to remove into a set one time, and then just read/filter/write the original dataset. Sets are optimized for fast lookup. This operation will be I/O bound and would not benefit from parallelization.

    with open("ids-to-remove") as f:
        ids_to_remove = {seq_id_line.strip() for seq_id_line in f}
    # just in case there are blank lines
    if "" in ids_to_remove:
        ids_to_remove.remove("")
    with open("original-data-set") as orig, open("filtered-data-set", "w") as filtered:
        filtered.writelines(line for line in orig if line.split()[0] not in ids_to_remove)