I have a Snakemake workflow with a single Snakefile and a single config file. In my Snakefile, I specify a job, which are numbered non-sequentially (e.g. 210,215). For each job I can specify, the config file has a corresponding entry which has the information about that particular job (with parameters like year, number of subjobs, a prefix for files, etc, all stored as strings). In rules, to construct input and output, I use statements like config[job]["year"]
to provide the correct strings for each job.
A simplified example of my workflow to hopefully demonstrate how I use the information from the config file:
# SNAKEFILE
job=210
rule all:
input:
expand(config["outputdir"]+"/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root",sample=config[job]["samples"])
...other rules...
rule filter_2:
input:
config["outputdir"]+"/filter-1-applied/{sj}/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root"
output:
config["outputdir"]+"/filter-2-applied/{sj}/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root"
shell:
"(bash scripts/filter-2.sh {input} {output}) 2> {log}"
...other rules...
CONFIG.YAML
outputdir="/home/ghl/outputs"
210:
prefix: "Real"
year: "2016"
origindir: "/home/ghl/files/210"
subjobs: 2653
originID: "_abc123"
samples: ["type1_v1","type1_v2","type2_v1","type2_v2"]
This was fine when I had a small number of jobs, but now that I have ~80 to run over, some taking several hours even when submitted on a batch submission system I have access to, it takes forever to manually run each, wait, change the 'job' attribute, and run again. What I would like to do is to be able to run multiple jobs (e.g. 210 and 215) from a single run of this Snakefile.
In python I would just enclose this all in a loop like:
for job in [1,3,...,210,215]:
<run single job workflow>
print("Done!")
I'm trying to do the same in my Snakefile. I've tried putting job=jobs
in the input for 'rule all' as I do for samples, and defining jobs=[210,215]
, or changing the input to be a function which returns the corresponding filenames from a list of jobs, but both run into issues related to the fact that 'job' is no longer a python variable in the script, but is now a wildcard, and it's unclear to me how I should provide a wildcard to something like config[job]["year"]
:
config[{job}]["year"]
or config["{job}"]
doesn't work (specifically, they give NameError or KeyError).
Is there a way to achieve this (ideally without a total rewrite)? A modification in the vein of what I've mentioned (or somehow running this workflow from a separate snakefile?) would be ideal, and I imagine that this is probably doable by just replacing all instances of config[job] with <something> and changing the input of 'rule all' to include a list of job numbers...
Thanks in advance!
If anyone else wants to know how I solved this, it required something of a rewrite, and fairly extensive use of lambda functions, and additionally, all files are now prefixed with their job number (I have a bash script that runs outside of snakemake to delete them all). I'm sure much of this is surplus to requirements, but it works well enough for me.
I specify a list of jobs in config: jobs: [j210,j215]
(the j prefix is required as snakemake gets a key error if it interprets them as ints instead of strings, for reasons I don't quite understand)
I add an extra make_final rule that only depends on jobs, and modify all (and also use lots of wildcard constraints, your need for them may vary). This makes job into a wildcard, and so config[job]
can be accessed within either input
, or params
, with a lambda function: config[wildcards.job]
rule all:
input:
expand("completed/{job}.log",job=config["jobs"])
rule make_final:
# this input is just my final file from the chain of rules. Awkward syntax as requires a list expansion - each source job produces 4 output files
input:
lambda wildcards : [(config["outputdir"]+"/{job}_"+config[wildcards.job]["prefix"]+"_test_"+config[wildcards.job]["year"]+config[wildcards.job]["originID"]+"_"+foobar+".root") for foobar in config[wildcards.job]["samples"]],
output:
"completed/{job}.log"
shell:
"touch {output}"
And earlier rules are modified, e.g. like this:
rule filter_2_mc:
input:
# this new approach allows neater/more natural phrasing here, rather than
# using lots of config[job]["blah"] statements
config["outputdir"]+"/filter-1-applied/{sj}/{job}_{prefix}_test_{year}{originID}_{sample}.root"
output:
config["outputdir"]+"/filter-2-applied/{sj}/{job}_{prefix}_test_{year}{originID}_{sample}.root"
shell:
"bash scripts/filter-2-new.sh {input} {output}"
Some rules needed lambda functions for their input: or params: if anything from config[wildcards.job] needs to be specified.
(also apologies if answering my own question and marking it as the correct answer isn't allowed)