Search code examples
pythonwildcarddirectory-structuresnakemakecp

Snakemake: Mismatched Wildcards Variable Values for "output" Rule


I am encountering a problem that doesn't seem to occur consistently between folders.

Essentially, I thought I had a Snakemake pipeline that would work to copy files into folders (with different destinations for different subfolders). I am currently accomplishing this with some Python dictionaries as well as 2 wildcard values.

However, I am currently encountering a problem that I believe is due to a mismatch between the {outf} and {sample} wildcards values.

Brief Description

I believe that the wildcards are defined with rule all:

rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)

In the example that I will describe below:

  • Pairing of {outf} and {sample} is correct for input
  • Pairing of {outf} and {sample} is not correct in the log output for output
  • Pairing of {outf} and {sample} is not correct in the log output for wildcards

Additional Details

I am removing some details related to the exact formatting, but the code is basically as follows:

import pandas as pd
import os
import re

data = pd.read_csv("mapping_list.csv").set_index('Subfolder', drop=False)
SAMPLES = data["Subfolder"].tolist()
OUTPREFIXES = data["Output"].tolist()

def get_input_folder(wildcards):
    return data.loc[wildcards.sample]["Input"]

def get_output_folder(wildcards):
    return data.loc[wildcards.sample]["Output"]
    
rule all:
    input:
        expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)

rule copy_folders:
    input:
        infolder = directory(get_input_folder),
        outfolder = directory(get_output_folder),
    output:
        os.path.join("{outf}","{sample}","methods.txt"),
    resources:
        mem_mb=2000,
        cpus=1
    shell:
        '''
        SHOUT1={input.outfolder}
        ...
        cp -R {input.infolder} $SHOUT1
        
        TEMPSAMPLE=$(basename {input.infolder})
        SHEND={input.outfolder}/$TEMPSAMPLE
        ...
        cp ../methods.txt $SHEND
        '''

I am receiving the following error message:

Waiting at most 5 seconds for missing files.
MissingOutputException in line 22 of /path/to/Snakefile:
Missing files after 5 seconds:
[Variable Destination Folder B]/[Sample A]/methods.txt

I believe that I can see the problem in an earlier part of the log :

rule copy_folders:
    input: /common/folder/path/[Sample A], [Variable Destination Folder A]
    output: [Variable Destination Folder B]/[Sample A]/methods.txt
    jobid: 171
    wildcards: outf=[Variable Destination Folder B], sample=[Sample A]
    resources: mem_mb=2000, cpus=1

I have a sample sheet where various folders are paired with a unique sample ID. On a given line, you would find [Sample A] and [Variable Destination Folder A]. On a different line, you would find [Sample B] and [Variable Destination Folder B], etc..

In other words, the mismatch for the wildcards at the earlier step matches the error message in that it describes a file that should not be created at that point (because the values for {outf} and {sample} are not matched correctly, for different lines "A" and "B").

The methods.txt file is not strictly needed. However, I encountered problems when trying to use a directory as the endpoint, so I copied an extra file and I used that as the endpoint. If it helps, I can share the earlier code. However, for 1 different folder with a smaller number of subfolders to copy and less complicated destination folders, something similar to the current code appeared to work successfully.

I had an earlier version of the code to try and make sure that the shell environment variables were "local" to each folder. I think the use of "local" caused a problem in itself, which an error message indicating that can only be used within a function.

However, if use the similarly simplified portion of the shell code, then the paths were filled in as follows:

        local SHOUT1=[Variable Destination Folder A]
        ...
        cp -R /common/folder/path/[Sample A] $SHOUT1
        
        local TEMPSAMPLE=$(basename /common/folder/path/[Sample A])
        local SHEND=[Variable Destination Folder A]/$TEMPSAMPLE
        ...
        cp ../methods.txt $SHEND

In other words, it looks like the paths for the shell command were correct (all for line "A" in the sample mapping file). I assume this is because they only used input wildcards values, because I noticed a problem with the variable mismatching. Some troubleshooting was added to be able to handle a folder with a space in the name where different parts of the same script need to use "\ " versus " " to run correctly), but I am excluding those folders to try and simplify the most immediate troubleshooting. However, I can't run the Snakemake script if I can't specify the output value correctly.

Any assistance with troubleshooting would be greatly appreciated!

I thought this should be a relatively simple example to start learning Snakemake for what is basically cp -R $INPUTSUBFOLDER $OUTPUTFOLDER, but perhaps there are more complications than I realized.

Sincerely,

Charles


Solution

  • To me it looks like it pairs the input to your copy_folders rule correctly because you're using an input function that only uses your sample wildcard to get it. For the output, though, there's a mismatch because if you run the Snakefile without specifying another target, it wants all combinations of sample and outf that you specified in rule all.

    If you only want to pair [Sample A] with [Variable Destination Folder A] and so on, you'll need to change how Snakemake handles your expand() in rule all.

    Right now, what you have is

    rule all:
        input:
            expand(os.path.join("{outf}","{sample}","methods.txt"), outf=OUTPREFIXES, sample=SAMPLES)
    

    This pairs all prefixes in OUTPREFIXES with all samples in SAMPLES, which is the standard behavior of expand(). You can specify a different combinatoric function in expand(), though - if you only want to combine the first sample with the first destination, the second with the second etc., your rule all should instead use zip, like so:

    rule all:
        input:
            expand(os.path.join("{outf}","{sample}","methods.txt"), zip, outf=OUTPREFIXES, sample=SAMPLES)