How To Embed Custom Python Functions and Multiple Wildcards in Snakemake Rules

I am new to snakemake and I try to understand it better. I went through the docs, but sometimes I struggle with some of the concepts. Let's assume I have a snakemake rule which is taking an input file, and generates some output.

I prepared a config file that looks like this:

Config.yml:

mydir: "/usr/home/data/"

My tabular data looks like this:

sample	file
sampleA	file1.fastq
sampleB	file2.fastq
sampleC	file3.fastq

I read the tabular data into the snakefile:

configfile: config.yml

import pandas as pd
table = pd.read_table("samples.tsv").set_index("sample", drop=False)

Then I have a rule all at the top of my snakefile which collects the output from my dummy rule:

rule all:
   expand("output/{sample}_output.txt", sample=table["file"])

rule dummy:
input: config["mydir"] + "{sample}"
output: "output/{sample}_output.txt"
shell: "sometool {input} {output}"

How can I check if the file, i.e. file1.fastq given in my tsv file is in folder A or folder B if i have two sub-directories under the data folder, and then use it depending on its existence as input in the rule? Also how can I name my output after the sample for the respective file? For example, instead of having the output file1.fastq_output.txt, I want the output named to be after the respective sample name sampleA_output.txt, but still use "file1.fastq" as input to run the shell command.

My idea was to check if the file is in folder A or B with a custom python function outside of the snakemake rule with something like this:

def get_file(file):  
test = config["mydir"] + file  
if os.path.exists(test) == True:    
  return test

and then use this custom function as input for rule dummy

rule dummy:   
input: get_file({sample})   
output: "output/{sample}_output.txt"   
shell: "sometool {input} {output}"

For renaming the output files based on the respective sample column I thought of using maybe the loc function, since the rows are based on the sample column and can be accessed in this format table.loc["rowname"]["column"]. However, the get_file function does not work properly as input for my dummy rule. I am not sure, why this happens and how I can rename my output based on the sample name, but still use the file column to process my shell command?

Solution

One issue I see in your code is that the function-as-input get_file should take in input a single parameter being the wildcard object. So something like:

def get_file(wc):
    file = table[table.sample == wc.sample]['file'].iloc[0]
    ## Possibly some more logic to return the input file.
    return file

Then use the function as:

rule dummy:   
    input: get_file,
    output: "output/{sample}_output.txt"   
    shell: "sometool {input} {output}"

More explanation about get_file function: This function should return a string or a list of strings corresponding to the file(s) you want to use as input. E.g. it may return file1.fastq or [path/to/file1.fastq, path/to/file2.fastq], whatever.

To get the correct file(s) for a given value of wildcard sample, you can query the sample table table to get the value of the corresponding file. How you implement that is up to you and it's more a python and pandas question than a snakemake one.

What if you want get_file to take additional parameters? E.g. explicitly pass table and config as parameters? Then you can do:

def get_file(wc, table, config, stuff):
    file = table[table.sample == wc.sample]['file'].iloc[0]
    ## More work with config, stuff etc...

and use it as:

rule one:
    input:
        fin=lambda wc: get_file(wc, my_table, my_config, my_stuff),
    ...