I have written Snakemake rule which runs Muscle (MSA-tool) to calculate multiple sequence alignment (MSA) for all files in a directory. The task is trivially parallel, as different files do not depend on each other. The problem is, that Snakemake runs this rule in n-number of "batches", where n is cores given to Snakemake as an argument:
snakemake -j 4 msa
.
Snakemake starts with running 4 jobs in parallel and it waits until each one of them is finished before starting a new "batch" of 4 jobs. This wastes CPU time, as the input files vary a lot in size and their MSA calculation time can vary from seconds to minutes. Resulting in following execution flow:
job1|----- |job5|----- |...|->
job2|--- |job6|-------- |...|->
job3|----------------|job7|-- |...|->
job4|- |job8|----------|...|->
How could I tell Snakemake to truly parallelize the jobs?
CLUSTER_IDS, = glob_wildcards(os.path.join(WORK_DIR, "fasta", "{id}.fasta"))
rule msa:
input:
expand(os.path.join(WORK_DIR, "msa", "{id}.afa"), id=CLUSTER_IDS)
rule:
input:
os.path.join(WORK_DIR, "fasta", "{id}.fasta")
output:
os.path.join(WORK_DIR, "msa", "{id}.afa")
shell:
"{MUSCLE_PATH}/muscle3.8.31_i86darwin64 -in {input} -out {output}"
The issue was solved after updating Snakemake to version 6.5.2 from 5.30.1.