Search code examples
pythonsnakemake

Modify snakefile to run multiple iterations of one workflow


I have a Snakemake workflow with a single Snakefile and a single config file. In my Snakefile, I specify a job, which are numbered non-sequentially (e.g. 210,215). For each job I can specify, the config file has a corresponding entry which has the information about that particular job (with parameters like year, number of subjobs, a prefix for files, etc, all stored as strings). In rules, to construct input and output, I use statements like config[job]["year"] to provide the correct strings for each job.

A simplified example of my workflow to hopefully demonstrate how I use the information from the config file:

# SNAKEFILE
job=210
rule all:
    input:
        expand(config["outputdir"]+"/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root",sample=config[job]["samples"])
...other rules...
rule filter_2:
    input:
        config["outputdir"]+"/filter-1-applied/{sj}/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root"
    output:
        config["outputdir"]+"/filter-2-applied/{sj}/"+config[job]["prefix"]+"_test_"+config[job]["year"]+config[job]["originID"]+"_{sample}.root"
    shell:
        "(bash scripts/filter-2.sh {input} {output}) 2> {log}"
...other rules...

CONFIG.YAML
outputdir="/home/ghl/outputs"
210:                                                                                                                                                                                                               
    prefix: "Real"
    year: "2016"
    origindir: "/home/ghl/files/210"
    subjobs: 2653
    originID: "_abc123"
    samples: ["type1_v1","type1_v2","type2_v1","type2_v2"]

This was fine when I had a small number of jobs, but now that I have ~80 to run over, some taking several hours even when submitted on a batch submission system I have access to, it takes forever to manually run each, wait, change the 'job' attribute, and run again. What I would like to do is to be able to run multiple jobs (e.g. 210 and 215) from a single run of this Snakefile.

In python I would just enclose this all in a loop like:

for job in [1,3,...,210,215]:
    <run single job workflow>
print("Done!")

I'm trying to do the same in my Snakefile. I've tried putting job=jobs in the input for 'rule all' as I do for samples, and defining jobs=[210,215], or changing the input to be a function which returns the corresponding filenames from a list of jobs, but both run into issues related to the fact that 'job' is no longer a python variable in the script, but is now a wildcard, and it's unclear to me how I should provide a wildcard to something like config[job]["year"] :
config[{job}]["year"] or config["{job}"] doesn't work (specifically, they give NameError or KeyError).

Is there a way to achieve this (ideally without a total rewrite)? A modification in the vein of what I've mentioned (or somehow running this workflow from a separate snakefile?) would be ideal, and I imagine that this is probably doable by just replacing all instances of config[job] with <something> and changing the input of 'rule all' to include a list of job numbers...

Thanks in advance!


Solution

  • If anyone else wants to know how I solved this, it required something of a rewrite, and fairly extensive use of lambda functions, and additionally, all files are now prefixed with their job number (I have a bash script that runs outside of snakemake to delete them all). I'm sure much of this is surplus to requirements, but it works well enough for me.

    I specify a list of jobs in config: jobs: [j210,j215] (the j prefix is required as snakemake gets a key error if it interprets them as ints instead of strings, for reasons I don't quite understand)

    I add an extra make_final rule that only depends on jobs, and modify all (and also use lots of wildcard constraints, your need for them may vary). This makes job into a wildcard, and so config[job] can be accessed within either input, or params, with a lambda function: config[wildcards.job]

    rule all:
        input:
           expand("completed/{job}.log",job=config["jobs"])
    
    rule make_final:
        # this input is just my final file from the chain of rules. Awkward syntax as requires a list expansion - each source job produces 4 output files
        input:
            lambda wildcards : [(config["outputdir"]+"/{job}_"+config[wildcards.job]["prefix"]+"_test_"+config[wildcards.job]["year"]+config[wildcards.job]["originID"]+"_"+foobar+".root") for foobar in config[wildcards.job]["samples"]],
        output:
            "completed/{job}.log"
        shell:
            "touch {output}"
    

    And earlier rules are modified, e.g. like this:

    rule filter_2_mc:
        input:
            # this new approach allows neater/more natural phrasing here, rather than
            # using lots of config[job]["blah"] statements
            config["outputdir"]+"/filter-1-applied/{sj}/{job}_{prefix}_test_{year}{originID}_{sample}.root"
        output:
            config["outputdir"]+"/filter-2-applied/{sj}/{job}_{prefix}_test_{year}{originID}_{sample}.root"                                                                                                                       
        shell:
            "bash scripts/filter-2-new.sh {input} {output}"
    

    Some rules needed lambda functions for their input: or params: if anything from config[wildcards.job] needs to be specified.

    (also apologies if answering my own question and marking it as the correct answer isn't allowed)