I am new to snakemake and I try to understand it better. I went through the docs, but sometimes I struggle with some of the concepts. Let's assume I have a snakemake rule which is taking an input file, and generates some output.
I prepared a config file that looks like this:
Config.yml:
mydir: "/usr/home/data/"
My tabular data looks like this:
sample | file |
---|---|
sampleA | file1.fastq |
sampleB | file2.fastq |
sampleC | file3.fastq |
I read the tabular data into the snakefile:
configfile: config.yml
import pandas as pd
table = pd.read_table("samples.tsv").set_index("sample", drop=False)
Then I have a rule all at the top of my snakefile which collects the output from my dummy rule:
rule all:
expand("output/{sample}_output.txt", sample=table["file"])
rule dummy:
input: config["mydir"] + "{sample}"
output: "output/{sample}_output.txt"
shell: "sometool {input} {output}"
How can I check if the file, i.e. file1.fastq given in my tsv file is in folder A or folder B if i have two sub-directories under the data folder, and then use it depending on its existence as input in the rule? Also how can I name my output after the sample for the respective file? For example, instead of having the output file1.fastq_output.txt, I want the output named to be after the respective sample name sampleA_output.txt, but still use "file1.fastq" as input to run the shell command.
My idea was to check if the file is in folder A or B with a custom python function outside of the snakemake rule with something like this:
def get_file(file):
test = config["mydir"] + file
if os.path.exists(test) == True:
return test
and then use this custom function as input for rule dummy
rule dummy:
input: get_file({sample})
output: "output/{sample}_output.txt"
shell: "sometool {input} {output}"
For renaming the output files based on the respective sample column I thought of using maybe the loc function, since the rows are based on the sample column and can be accessed in this format table.loc["rowname"]["column"]. However, the get_file function does not work properly as input for my dummy rule. I am not sure, why this happens and how I can rename my output based on the sample name, but still use the file column to process my shell command?
One issue I see in your code is that the function-as-input get_file
should take in input a single parameter being the wildcard object. So something like:
def get_file(wc):
file = table[table.sample == wc.sample]['file'].iloc[0]
## Possibly some more logic to return the input file.
return file
Then use the function as:
rule dummy:
input: get_file,
output: "output/{sample}_output.txt"
shell: "sometool {input} {output}"
More explanation about get_file
function: This function should return a string or a list of strings corresponding to the file(s) you want to use as input. E.g. it may return file1.fastq
or [path/to/file1.fastq, path/to/file2.fastq]
, whatever.
To get the correct file(s) for a given value of wildcard sample
, you can query the sample table table
to get the value of the corresponding file
. How you implement that is up to you and it's more a python and pandas question than a snakemake one.
What if you want get_file
to take additional parameters? E.g. explicitly pass table
and config
as parameters? Then you can do:
def get_file(wc, table, config, stuff):
file = table[table.sample == wc.sample]['file'].iloc[0]
## More work with config, stuff etc...
and use it as:
rule one:
input:
fin=lambda wc: get_file(wc, my_table, my_config, my_stuff),
...