Snakemake processing large workflow slow due to lengthy sequential checking of job completion? >100x speed reduction

I am working on a rather complex snakemake workflow that spawns several hundreds of thousands of jobs. Everything works... The workflow executes, DAG gets created (thanks to the new checkpoint implementation), but is unbearably slow. I think the bottleneck is in checking off the successfully completed jobs before continuing to the next. More annoyingly it does this sequentially for all the jobs that are started in a batch, and the next batch is only executed when all checks are successful. The execution time of each job takes about 1 or two seconds and is done in parallel (1 core per job), but snakemake then cycles through the job completion check one at a time taking 5 to 10 seconds per job. So the whole process per batch takes minutes. See part of the log below that shows the different timestamps of consecutive jobs run in the same batch. There is about 11 seconds difference between the timestamps

Finished job 42227. 5853 of 230419 steps (3%) done [Thu Feb 28 09:41:09 2019] Finished job 119645. 5854 of 230419 steps (3%) done [Thu Feb 28 09:41:21 2019] Finished job 161354. 5855 of 230419 steps (3%) done [Thu Feb 28 09:41:32 2019] Finished job 160224. 5856 of 230419 steps (3%) done [Thu Feb 28 09:41:42 2019] Finished job 197063. 5857 of 230419 steps (3%) done [Thu Feb 28 09:41:53 2019] Finished job 200063. 5858 of 230419 steps (3%) done [Thu Feb 28 09:42:04 2019] Finished job 45227. 5859 of 230419 steps (3%) done [Thu Feb 28 09:42:15 2019] Finished job 44097. 5860 of 230419 steps (3%) done [Thu Feb 28 09:42:26 2019] Finished job 5387. 5861 of 230419 steps (3%) done [Thu Feb 28 09:42:37 2019] Finished job 158354. 5862 of 230419 steps (3%) done [Thu Feb 28 09:42:48 2019]

So for 20 parallel processes 2 seconds are used for computation, but then goes idle for 20x11 = 220 seconds before continuing processing the next 20 jobs.

In the run above I had about 200K+ jobs. If I scale down the time between logs becomes much much shorter:

Finished job 2630. 5 of 25857 steps (0.02%) done [Thu Feb 28 10:14:31 2019] Finished job 886. 6 of 25857 steps (0.02%) done [Thu Feb 28 10:14:31 2019] Finished job 9606. 7 of 25857 steps (0.03%) done [Thu Feb 28 10:14:31 2019] Finished job 4374. 8 of 25857 steps (0.03%) done [Thu Feb 28 10:14:32 2019] Finished job 6118. 9 of 25857 steps (0.03%) done [Thu Feb 28 10:14:32 2019] Finished job 7862. 10 of 25857 steps (0.04%) done [Thu Feb 28 10:14:32 2019] Finished job 278. 11 of 25857 steps (0.04%) done [Thu Feb 28 10:14:32 2019] Finished job 2022. 12 of 25857 steps (0.05%) done [Thu Feb 28 10:14:33 2019]

With 25K jobs the checking time is now < 1 second.

I hope I am missing an argument here, or some parameter, but I am fearing that it maybe something non-trivial.

I run snakemake with the following flags: snakemake --keep-going --snakefile My.Snakefile --configfile config.yaml -j 23 --max-jobs-per-second 23

I am running my workflow locally on a 24 core system with 256 GB ram.

Any help to speed things up would greatly be appreciated!

John

Solution

I've 'solved' my issue for now by replacing the "one to many to one" parts of my workflow with calling GNU parallel.

Pro's:

No file checking, so removed the snakemake overhead
Removed the necessity for checkpoints
Simplified DAG

Con's:

No file checking, so failed jobs are less easy to retrieve and entire computational heavy sections of the workflow would have to be redone, not just the files that fails.
Need to dive into other logs in order to find what exactly went wrong.

I suggest to use the --halt option with 'now,fail=1' and the --joblog option for parallel to catch problematic files.

On a test set I've reduced my calculation time from 4 hours to 15 minutes.

I still feel that snakemake should be able to handle this somehow, but I need to move on with my project.