Sorry if this is gonna be probably a duplication of other questions, but I couldn't figure to debug what's going on in my case. Got a dataframe like this:
Sample gender phenotype subject_id
0 ERR35175 male tumor 13
1 ERR35176 male control 13
2 ERR35177 female tumor 14
3 ERR35178 female control 14
4 ERR35179 male tumor 16
5 ERR35180 male control 16
Given a subject_id
, I need to give in input either the tumor
and the control
sample from the dataframe, concatenating with path from config and file termination, to produce an output which uses the subject_id
. To do so, I've written this rule (under file snv_calling.smk
):
rule Mutect2:
input:
tumor=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "tumor") & (df.subject_id == [wc.patient])].Sample.values[0],
normal=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "control") & (df.subject_id == [wc.patient])].Sample.values[0]
output:
vcf=config["datadirs"]["VCF"]+"/{patient}.vcf"
shell:
"""
gatk Mutect2 \
-I {input.tumor} \
-I {input.normal} \
-O {output.vcf}
"""
Inside the Snakefile
:
PATIENT=['13','14','16']
rule all:
input:
expand(config["datadirs"]["VCF"]+"/"+"{patient}.vcf", patient=PATIENT)
It gives me this error, where line 37 is the first input argument:
InputFunctionException in line 37 of ../rules/snv_calling.smk:
Error:
ValueError: ('Lengths must match to compare', (0,), (1,))
Wildcards:
patient=13
Traceback:
File "../rules/snv_calling.smk", line 45, in <lambda>
I'm struggle to understand what's going on, because it seems that the wildcards patient
is assigned properly from the error. If I run the function outside of Snakemake there's no error against the PATIENT
list.
The parameters are stored in a dataframe, and there is a handy utility for working with tabulated parameters, Paramspace
. Below is a rough take on your specific case, but it will need some adjustments for command syntax and paths.
First step is to reshape the data for easier workflow:
from io import StringIO
import pandas as pd
data = StringIO(
"""index Sample gender phenotype subject_id
0 ERR35175 male tumor 13
1 ERR35176 male control 13
2 ERR35177 female tumor 14
3 ERR35178 female control 14
4 ERR35179 male tumor 16
5 ERR35180 male control 16"""
)
df = pd.read_csv(data, sep="\s+")
df = df.pivot(
index=["subject_id", "gender"], values="Sample", columns="phenotype"
).reset_index()
#phenotype subject_id gender control tumor
#0 13 male ERR35176 ERR35175
#1 14 female ERR35178 ERR35177
#2 16 male ERR35180 ERR35179
Now, create the parameter space:
from snakemake.utils import Paramspace
paramspace = Paramspace(df, filename_params='*')
Finally, modify the rules to use the parameter space:
rule all:
input:
paramspace.instance_patterns
rule Mutect2:
output:
done=touch(paramspace.wildcard_pattern),
params:
parameters=paramspace.instance,
shell:
"""
echo {params.parameters[gender]}
echo {params.parameters[tumor]}
echo {params.parameters[control]}
"""
Update:
It's possible to adapt it to work with intermediate outputs. The parameter space acts as a pandas dataframe, so it's possible to select columns of interest:
rule all:
input:
paramspace.instance_patterns,
rule some_rule:
output:
test=paramspace[["gender"]].wildcard_pattern,
rule Mutect2:
input:
test=paramspace[["gender"]].wildcard_pattern,
output:
done=touch(paramspace.wildcard_pattern),
params:
parameters=paramspace.instance,
shell:
"""
echo {params.parameters[gender]}
echo {params.parameters[tumor]}
echo {params.parameters[control]}
"""