Search code examples
solversnakemake

"Select jobs to execute..." runs literally forever


I have a rather complex workflow with 750 samples and roughly 18.000 jobs, at first snakemake runs just fine but then after around 4.000 jobs it suddenly freezes and upon restart it hangs with "Select jobs to execute..." for 24h, after that I terminated it. The initial DAG building takes roughly 2-3 minutes, though.

When I run snakemake (v5.32.0 and v5.32.1) with the --verbose option, I get tons of lines similar to this one:

Cbc0010I After 600 nodes, 304 on tree, -52534.791 best solution, best possible -52538.194 (7.08 seconds

I tried to delete the .snakemake folder in the hope that something went riot there, but that wasn't the case, unfortunately. To me it seems that the CBC MILP Solver somehow does not converge, and it keeps going and going to bring the best and the best possible solution closer together!?

Now I do not have any idea anymore, how to proceed and fix the problem. My possible solutions are somehow to change the convergence criteria or the solver itself. In the manual I found the option --scheduler-ilp-solver but it has apparently only one option, the default COIN_CMD.

After terminating a (shorter) run, I get this verbose output

Result - User ctrl-cuser ctrl-c

Objective value:                52534.79114334
Upper bound:                    52538.202
Gap:                            -0.00
Enumerated nodes:               186926
Total iterations:               1807277
Time (CPU seconds):             1181.97
Time (Wallclock seconds):       1188.11

Next I will try to limit the number of samples in the workflow and see if this has any impact (for other datasets with 500 samples, it ran without any problems (with snakemake version 5.24), but there the DAG building took some hours. Hence, I am not very eager to try the old version.)

So, any idea how to fix the problem is highly appreciated. Also, I do not even know, if this is a bug!?

EDIT Actually, I believe it is a bug in the current version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...


Solution

  • I also ran into this issue with a smaller workflow (~1500 jobs total) and snakemake version 6.0.2. About half the jobs had run when the workflow got stuck, and refused to run any more jobs. Looks like it's a problem specific to the ILP solver, because when I re-ran with --scheduler greedy, it worked fine.