Search code examples
pythonpandassnakemakedirected-acyclic-graphsnextflow

Array of values as input in Snakemake workflows


I started to migrate my workflows from Nextflow to Snakemake and already hitting the wall at the start of my pipelines which very often begin with a list of numbers (representing a "run number" from our detector).

What I have for example is a run-list.txt like

# detector_id run_number
75 63433
75 67325
42 57584
42 57899
42 58998

which then needs to be passed line by line to a process that queries a database or data storage system and retrieves a file to the local system.

This means that e.g. 75 63433 would generate the output RUN_00000075_00063433.h5 via a rule which receives detector_id=75 and run_number=63433 as input parameters.

With Nextflow this is fairly easy, just defining a process which emits a tuple of these values.

I don't quite understand how I can do something like this in Snakemake since it seems that inputs and outputs always needs to be files (remote or local). In fact, some of the files are indeed accessible via iRODS and/or XRootD but even then, I need to start with a run-selection first which is defined in a list like the run-list.txt above.

My question is now: what is the Snakemake-style approach to this problem?

A non-working pseudo-code would be:

rule:
    input:
        [line for line in open("run-list.txt").readlines()]
    output:
        "{detector_id}_{run_number}.h5"
    shell:
        "detector_id, run_number = line.split()"
        "touch "{detector_id}_{run_number}.h5""

Solution

  • To make this work you need two ingredients:

    1. a rule that specifies the logic for generating a single file (defining any file dependencies, if necessary)
    2. a rule that defines which file should be calculated, by convention this rule is called all.

    Here is a rough sketch of the code:

    def process_lines(file_name):
        """generates id/run, ignoring non-numeric lines"""
        with open(file_name, "r") as f:
            for line in f:
                detector_id, run_number, *_ = line.split()
                if detector_id.isnumeric() and run_number.isnumeric():
                    detector_id = detector_id.zfill(8)
                    run_number = run_number.zfill(8)
                    yield detector_id, run_number
    
    
    out_file_format = "{detector_id}_{run_number}.h5"
    final_files = [
        out_file_format.format(detector_id=detector_id, run_number=run_number)
        for detector_id, run_number in process_lines("run-list.txt")
    ]
    
    
    rule all:
        """Collect all outputs."""
        input:
            final_files,
    
    
    rule:
        """Generate an output"""
        output:
            out_file_format,
        shell:
            """
            echo {wildcards[detector_id]}
            echo {wildcards[run_number]}
            echo {output}
            """