Search code examples
inputoutputsnakemake

How to parallelize jobs for a list of files using snakemake (beginner question)


I am struggling with a very simple thing. On input of my snakemake pipeline I would like to have a directory, list its content, and process each file from that directory in parallel. Naively I thought something like this should work:

rule all:
    input:
        "in/{test}.txt"
    output:
        "out/{test}.txt"
    shell:
        "echo {input} >> {output}"

This ends with the error

WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

All the resources I could find start with hard-coding the list of jobs in the script, which is something I want to avoid to keep the pipeline generic. The idea is to just point the pipeline to a directory with a list of files and let it do its job. Is this possible? Seems fairly simple and intuitive, but couldn't find an example showing that.


Solution

  • I don't know what command you used for this rule, but the following workflow should suffice your purpose

    rule all:
        input:
            expand("out/{prefix}.txt", prefix=glob_wildcards("in/{test}.txt").test)
    
    rule test:
        input:
            "in/{test}.txt"
        output:
            "out/{test}.txt"
        shell:
            "echo {input} >> {output}"
    

    glob_wildcards is a function by snakemake to find out all the files that match the specified pattern (in/{test}.txt in this case), then .text is to get the list of strings that match {test} in filenames (example: "ab" in "in/ab.txt").

    Then expand can fill the string to the placeholder variable that wrapped by curly bracket, then generate a list of input file names.

    So rule all wants a list of input files correspond to all txt files in in folder, then it would let snakemake execute rule test for every file