I am still very confused about the wildcards concept despite reading the full docs and a few examples, so maybe someone can shed light on this weird behaviour. It might be a bug but it's such a basic example that I am pretty sure I am doing or understanding something wrong.
Here is my Snakefile
which should generate a bunch of files defined in a dictionary where the location of the files is stored (those can be served by all kinds of data providers like iRODS, XRootD etc., but it's not important now).
import os
some_files = {
"foo": "some_location/foo",
"bar": "another_location/bar",
"baz": "yet_another_loc/baz"
}
rule all:
input: ["raw/" + os.path.basename(f) for f in some_files.keys()]
rule generate_files:
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {some_files[wildcards.fname]} > {output}"
As you can see, I need to use a similar "trick" which was proposed in my previous question (Array of values as input in Snakemake workflows) to force the recognition of the files by adding a rule and listing those (in rule all
), which works nicely.
The rule generate_files
should then generate (retrieve) those by using the corresponding URL and protocol defined in some_files
. For the sake of simplicity, it's now just echoing the origin into the output
file.
To achieve this, I thought I can simply use the wildcards.fname
in the shell
section but I when I run the workflow, I get:
░ tamasgal@silentbox-(2):PhD/snakemake master ●●● snakemake took 16s
░ 08:47:35 > snakemake -c1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
generate_files 3 1 1
total 4 1 1
Select jobs to execute...
[Fri Feb 18 08:47:38 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'wildcards.fname' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
If I use fname
(and not wildcards.fname
), Snakemake
proposes to use wildcards.fname
, which again, does not work. Here is the output when running with fname
in output
:
[Fri Feb 18 08:47:48 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'fname' is unknown in this context. Did you mean 'wildcards.fname'?
Why is this happening? The output of the workflow clearly shows that wildcards: fname=bar
, so it exists and is defined. Is this a bug?
Hm, you may have to try and get at some_files[wildcards.fname]
outside of the shell
part? It looks to me like it can tell what the wildcard is supposed to be for the output to be raw/bar
, but it can't handle using it to access the dict in the shell
part. It seems like this could be handled with an input function to me.
Off the top of my head:
rule generate_files:
input:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {input.some_file} > {output}"
EDIT: if it fails because the file isn't local so Snakemake can't find it, you may supply the path to it as a parameter instead:
rule generate_files:
params:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {params.some_file} > {output}"