I am running into a WildcardError: Wildcards in input files cannot be determined from output files
problem with Snakemake. The issue is that I don't want to keep a variable part of my input file name. For instance, suppose I have these files.
$ mkdir input
$ touch input/a-foo.txt
$ touch input/b-wsdfg.txt
$ touch input/c-3523.txt
And I have a Snakemake file like this:
subjects = ['a', 'b', 'c']
result_pattern = "output/{kind}.txt"
rule all:
input:
expand(result_pattern, kind=subjects)
rule step1:
input:
"input/{kind}-{fluff}.txt"
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
I want the output file names to just have the part I'm interested in. I understand the principle that every wildcard in input needs a corresponding wildcard in output. So is what I'm trying to do a sort of anti-pattern? For instance, I suppose there could be two files input/a-foo.txt
and input/a-bar.txt
, and they would overwrite each other. Should I be renaming my input files prior to feeding into snakemake?
I want the output file names to just have the part I'm interested in [...]. I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other.
It seems to me you need to decide how to resolve such conflicts. If the input files are:
input/a-bar.txt
input/a-foo.txt <- Note duplicate {a}
input/b-wsdfg.txt
input/c-3523.txt
How do you want the output files to be named and according to what criteria? The answer is independent of snakemake
but depending on your circumstances you could include python code within the Snakefile to do handle such conflicts automatically.
Basically, once you make such decisions you can work on the solution.
But suppose there are no file name conflicts, it seems like the wildcard system doesn't handle cases where you want to remove some variable fluff from a filename
The variable part can be handled using python's glob patterns:
import glob
...
rule step1:
input:
glob.glob("input/{kind}-*.txt")
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
You could even be more elaborate and use a dedicated function to match files given the {kind}
wildcard:
def get_kind_files(wc):
ff = glob.glob("input/%s-*.txt" % wc.kind)
if len(ff) != 1:
raise Exception('Exepected exactly 1 file for kind "%s"' % wc.kind)
# Possibly more checks tha you got the right file
return ff
rule step1:
input:
get_kind_files,
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""