I am encountering a problem that doesn't seem to occur consistently between folders.
Essentially, I thought I had a Snakemake pipeline that would work to copy files into folders (with different destinations for different subfolders). I am currently accomplishing this with some Python dictionaries as well as 2 wildcard values.
However, I am currently encountering a problem that I believe is due to a mismatch between the {outf}
and {sample}
wildcards values.
I believe that the wildcards are defined with rule all
:
rule all:
input:
expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)
In the example that I will describe below:
{outf}
and {sample}
is correct for input
{outf}
and {sample}
is not correct in the log output for output
{outf}
and {sample}
is not correct in the log output for wildcards
I am removing some details related to the exact formatting, but the code is basically as follows:
import pandas as pd
import os
import re
data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()
def get_input_folder(wildcards):
return data.loc[wildcards.sample]["Input"]
def get_output_folder(wildcards):
return data.loc[wildcards.sample]["Output"]
rule all:
input:
expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)
rule copy_folders:
input:
infolder = directory(get_input_folder),
outfolder = directory(get_output_folder),
output:
os.path.join("{outf}","{sample}","methods.txt"),
resources:
mem_mb=2000,
cpus=1
shell:
'''
SHOUT1={input.outfolder}
...
cp -R {input.infolder} $SHOUT1
TEMPSAMPLE=$(basename {input.infolder})
SHEND={input.outfolder}/$TEMPSAMPLE
...
cp ../methods.txt $SHEND
'''
I am receiving the following error message:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 22 of /path/to/Snakefile:
Missing files after 5 seconds:
[Variable Destination Folder B]/[Sample A]/methods.txt
I believe that I can see the problem in an earlier part of the log :
rule copy_folders:
input: /common/folder/path/[Sample A], [Variable Destination Folder A]
output: [Variable Destination Folder B]/[Sample A]/methods.txt
jobid: 171
wildcards: outf=[Variable Destination Folder B], sample=[Sample A]
resources: mem_mb=2000, cpus=1
I have a sample sheet where various folders are paired with a unique sample ID. On a given line, you would find [Sample A]
and [Variable Destination Folder A]
. On a different line, you would find [Sample B]
and [Variable Destination Folder B]
, etc..
In other words, the mismatch for the wildcards
at the earlier step matches the error message in that it describes a file that should not be created at that point (because the values for {outf}
and {sample}
are not matched correctly, for different lines "A" and "B").
The methods.txt file is not strictly needed. However, I encountered problems when trying to use a directory as the endpoint, so I copied an extra file and I used that as the endpoint. If it helps, I can share the earlier code. However, for 1 different folder with a smaller number of subfolders to copy and less complicated destination folders, something similar to the current code appeared to work successfully.
I had an earlier version of the code to try and make sure that the shell environment variables were "local" to each folder. I think the use of "local" caused a problem in itself, which an error message indicating that can only be used within a function.
However, if use the similarly simplified portion of the shell code, then the paths were filled in as follows:
local SHOUT1=[Variable Destination Folder A]
...
cp -R /common/folder/path/[Sample A] $SHOUT1
local TEMPSAMPLE=$(basename /common/folder/path/[Sample A])
local SHEND=[Variable Destination Folder A]/$TEMPSAMPLE
...
cp ../methods.txt $SHEND
In other words, it looks like the paths for the shell command were correct (all for line "A" in the sample mapping file). I assume this is because they only used input
wildcards values, because I noticed a problem with the variable mismatching. Some troubleshooting was added to be able to handle a folder with a space in the name where different parts of the same script need to use "\ " versus " " to run correctly), but I am excluding those folders to try and simplify the most immediate troubleshooting. However, I can't run the Snakemake script if I can't specify the output
value correctly.
Any assistance with troubleshooting would be greatly appreciated!
I thought this should be a relatively simple example to start learning Snakemake for what is basically cp -R $INPUTSUBFOLDER $OUTPUTFOLDER
, but perhaps there are more complications than I realized.
Sincerely,
Charles
To me it looks like it pairs the input to your copy_folders
rule correctly because you're using an input function that only uses your sample
wildcard to get it. For the output, though, there's a mismatch because if you run the Snakefile without specifying another target, it wants all combinations of sample
and outf
that you specified in rule all
.
If you only want to pair [Sample A]
with [Variable Destination Folder A]
and so on, you'll need to change how Snakemake handles your expand()
in rule all
.
Right now, what you have is
rule all:
input:
expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)
This pairs all prefixes in OUTPREFIXES
with all samples in SAMPLES
, which is the standard behavior of expand()
. You can specify a different combinatoric function in expand()
, though - if you only want to combine the first sample with the first destination, the second with the second etc., your rule all
should instead use zip
, like so:
rule all:
input:
expand(os.path.join("{outf}","{sample}","methods.txt"), zip, outf=OUTPREFIXES, sample=SAMPLES)