My directory structure looks like this:
-- path
-- parameter_combination_1
- time_average.property1.csv
- time_average.property2.csv
- ...
-- parameter_combination_2
- time_average.property1.csv
- time_average.property2.csv
- ...
-- ...
I would like to create a rule which aggregates information of all files which carry the time_average
name, for the wildcards {filename} (e.g. property1.csv
) and {path}.
Hence, the input files for the example wildcard would be:
path/parameter_combination_1/time_average.property1.csv
path/parameter_combination_2/time_average.property1.csv
path/parameter_combination_3/time_average.property1.csv
I know that with expand
I can cover parameter combinations. This requires me to specify the parameters to be covered, e.g. I could write a rule with a fixed list parameter_combinations
as follows (and similar to this section in the Snakemake tutorial:
rule aggregate_time_averages:
input:
expand("{path}/{parameter_combination}/time_average.{filename}", parameter_combination=['parameter_combination_1', 'parameter_combination_2'])
output:
"{path}/aggregate.{filename}"
Is there a way to glob / to collect all parameter_combination_*
folders without having to specify a fixed list?
What would be the best practice in this case?
I also read about the glob_wildcards
function here.
I would expect something like this to work:
PROJECT_PATHS, PARAMS_COMBS, SEEDS = glob_wildcards("{path}/{parameter_combination}/{seed}/config.yaml")
rule aggregate_time_averages:
input:
expand("{path}/{parameter_combination}/time_avg.{filename}", parameter_combination=PARAMS_COMBS)
output:
"{path}/aggregate.{filename}"
With the command:
snakemake --cores 1 test_project/aggregate.order_parameter.csv --use-conda
I then get the error No values given for wildcard 'path'.
(which then cannot be processed by expand anymore I guess, so maybe I should not be using expand
here at all?).
Also glob_wildcards
in the global snakemake scope gives me ALL wildcards, what I want however is just the wildcards for {parameter_combination}
that match the {path}
/ {filename}
combination for which the rule is called (so I would expect the globbing to take place in the rule itself).
Thank you for your help :)
Sure, you can use an input function for your rule which evaluates the glob_wildcards
based on the wildcard values given to the rule:
def input_timefiles(wildcards):
param_combs = glob_wildcards(f"{wildcards.path}/{{param_comb}}/time_average.{wildcards.filename}").param_comb
return expand("{path}/{param_comb}/time_average.{filename}", path=wildcards.path, param_comb=param_combs, filename=wildcards.filename)
rule aggregate_time_averages:
input:
input_timefiles
output:
"{path}/aggregate.{filename}"
Note due to the default behaviour of expand(..)
this will produce the combination product of all {path}, {param_comb}, {filename}
which don't necessarily exist. If not all combinations exist, another solution could be to use pathlib.Path.rglob(..)
instead to determine the input files.
If the files are created by an earlier rule and don't exist before the workflow is executed, you might want to look into checkpoint
rules. See this SO answer for details.