python aggregate workflow snakemake expand

Snakemake | Creating an aggregate without specifying a list in expand

My directory structure looks like this:

-- path
   -- parameter_combination_1
      - time_average.property1.csv
      - time_average.property2.csv
      - ...
   -- parameter_combination_2
      - time_average.property1.csv
      - time_average.property2.csv
      - ...
   -- ...

I would like to create a rule which aggregates information of all files which carry the time_average name, for the wildcards {filename} (e.g. property1.csv) and {path}.

Hence, the input files for the example wildcard would be:

path/parameter_combination_1/time_average.property1.csv
path/parameter_combination_2/time_average.property1.csv
path/parameter_combination_3/time_average.property1.csv
...

I know that with expand I can cover parameter combinations. This requires me to specify the parameters to be covered, e.g. I could write a rule with a fixed list parameter_combinations as follows (and similar to this section in the Snakemake tutorial:

rule aggregate_time_averages:
    input:
        expand("{path}/{parameter_combination}/time_average.{filename}", parameter_combination=['parameter_combination_1', 'parameter_combination_2'])
    output:
        "{path}/aggregate.{filename}"

Is there a way to glob / to collect all parameter_combination_* folders without having to specify a fixed list? What would be the best practice in this case?

I also read about the glob_wildcards function here.

I would expect something like this to work:


PROJECT_PATHS, PARAMS_COMBS, SEEDS = glob_wildcards("{path}/{parameter_combination}/{seed}/config.yaml")

rule aggregate_time_averages:
    input:
        expand("{path}/{parameter_combination}/time_avg.{filename}", parameter_combination=PARAMS_COMBS)
    output:
        "{path}/aggregate.{filename}"

With the command:

snakemake --cores 1 test_project/aggregate.order_parameter.csv --use-conda

I then get the error No values given for wildcard 'path'. (which then cannot be processed by expand anymore I guess, so maybe I should not be using expand here at all?).

Also glob_wildcards in the global snakemake scope gives me ALL wildcards, what I want however is just the wildcards for {parameter_combination} that match the {path} / {filename} combination for which the rule is called (so I would expect the globbing to take place in the rule itself).

Thank you for your help :)

Solution

Sure, you can use an input function for your rule which evaluates the glob_wildcards based on the wildcard values given to the rule:

def input_timefiles(wildcards):
    param_combs = glob_wildcards(f"{wildcards.path}/{{param_comb}}/time_average.{wildcards.filename}").param_comb

    return expand("{path}/{param_comb}/time_average.{filename}", path=wildcards.path, param_comb=param_combs, filename=wildcards.filename)

rule aggregate_time_averages:
    input:
        input_timefiles
    output:
        "{path}/aggregate.{filename}"

Note due to the default behaviour of expand(..) this will produce the combination product of all {path}, {param_comb}, {filename} which don't necessarily exist. If not all combinations exist, another solution could be to use pathlib.Path.rglob(..) instead to determine the input files.

If the files are created by an earlier rule and don't exist before the workflow is executed, you might want to look into checkpoint rules. See this SO answer for details.