Search code examples
pythonsnakemake

Snakemake using input files in different folders summarizing by name


I'm trying to develop a pipeline that will take input files from different directories, specified in a yaml config file, and keep track of them by a name I specify in the yaml. For example, say my yaml looks like

input:
    name1: /some/path/to/file1
    name2: /a/totally/different/path/to/file2
    name3: /yet/another/path/to/file3
output: /path/to/outdir

I'd like to go through a series of steps and end up with an outdir that has the contents

/path/to/outdir/processed_name1.extension
/path/to/outdir/processed_name2.extension
/path/to/outdir/processed_name3.extension

I honestly can't get anything to work. The current state I've stalled at is trying to treat the names as wildcards, and using that to access the config dictionary. But this doesn't work, because the wildcards are never initialized, because the very first step is accessing the inputs. I can't be super specific with my code example due to company policy, but basically it looks like this:

rule all:
    input:
        processed_files = expand(config['output'] + "/processed_{name}.extension", name=config['input'])
        

rule step_1:
    input:
        input_file = lambda wc: config['input'][wc.name]
    output:
        intermediate_file = config['output'] + "/intermediate_{name}.extension"
    run:
        <some command>

rule step_2:
    input:
        intermediate_file = config['output'] + "/intermediate_{name}.extension"
    output:
        processed_file = config['output'] + "/processed_{name}.extension"
    run:
        <some command>

But this gives me wildcard errors, which makes sense I think---there's no way for it to figure out the wildcards, since they only exist in the config file. I feel like this is so similar to the example in the Advanced Workflow Example, but sufficiently different that I just can't get it to work...

EDIT 1: I replaced all the f-strings with string concatenation, just to make sure that's not an issue

EDIT 2: I eventually got it to work. I'm honestly not sure what changed, I must have had a typo or something... but I guess I can say that this overall structure worked.


Solution

  • I found no major errors in your shown code, though I removed fstrings and changed run: to shell: to make an easy test. The following works just fine with the appropriate configfile.

    configfile: "config.yaml"
    
    rule all:
        input:
            processed_files = expand(config['output'] + "/processed_{name}.extension", name=config['input'])
    
    
    rule step_1:
        input:
            input_file = lambda wc: config['input'][wc.name]
        output:
            intermediate_file = config['output'] + "/intermediate_{name}.extension"
        shell:
            "cat {input} > {output}"
    
    rule step_2:
        input:
            intermediate_file = config['output'] + "/intermediate_{name}.extension"
        output:
            processed_file = config['output'] + "/processed_{name}.extension"
        shell:
            "cat {input} > {output}"