Search code examples
pythonpandasparameterssnakemakedirected-acyclic-graphs

Snakemake input rule defintion via lambda + Pandas dataframe


Sorry if this is gonna be probably a duplication of other questions, but I couldn't figure to debug what's going on in my case. Got a dataframe like this:

       Sample  gender phenotype subject_id
0  ERR35175    male     tumor         13
1  ERR35176    male   control         13
2  ERR35177  female     tumor         14
3  ERR35178  female   control         14
4  ERR35179    male     tumor         16
5  ERR35180    male   control         16

Given a subject_id, I need to give in input either the tumor and the control sample from the dataframe, concatenating with path from config and file termination, to produce an output which uses the subject_id. To do so, I've written this rule (under file snv_calling.smk):

rule Mutect2:
    input:
        tumor=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "tumor") & (df.subject_id == [wc.patient])].Sample.values[0],
        normal=lambda wc: config["datadirs"]["BQSR_2"]+'/%s_recal.pass2.bam' %df[(df.phenotype == "control") & (df.subject_id == [wc.patient])].Sample.values[0]
    output:
        vcf=config["datadirs"]["VCF"]+"/{patient}.vcf"
    shell:
    """
    gatk Mutect2 \
    -I {input.tumor} \
    -I {input.normal} \
    -O {output.vcf}
    """

Inside the Snakefile:

PATIENT=['13','14','16']
rule all:
    input:
        expand(config["datadirs"]["VCF"]+"/"+"{patient}.vcf", patient=PATIENT)

It gives me this error, where line 37 is the first input argument:

InputFunctionException in line 37 of ../rules/snv_calling.smk:
Error:
  ValueError: ('Lengths must match to compare', (0,), (1,))
Wildcards:
  patient=13
Traceback:
  File "../rules/snv_calling.smk", line 45, in <lambda>

I'm struggle to understand what's going on, because it seems that the wildcards patient is assigned properly from the error. If I run the function outside of Snakemake there's no error against the PATIENT list.


Solution

  • The parameters are stored in a dataframe, and there is a handy utility for working with tabulated parameters, Paramspace. Below is a rough take on your specific case, but it will need some adjustments for command syntax and paths.

    First step is to reshape the data for easier workflow:

    from io import StringIO
    
    import pandas as pd
    
    data = StringIO(
        """index Sample  gender phenotype subject_id
    0  ERR35175    male     tumor         13
    1  ERR35176    male   control         13
    2  ERR35177  female     tumor         14
    3  ERR35178  female   control         14
    4  ERR35179    male     tumor         16
    5  ERR35180    male   control         16"""
    )
    
    df = pd.read_csv(data, sep="\s+")
    
    df = df.pivot(
        index=["subject_id", "gender"], values="Sample", columns="phenotype"
    ).reset_index()
    
    #phenotype  subject_id  gender   control     tumor
    #0                  13    male  ERR35176  ERR35175
    #1                  14  female  ERR35178  ERR35177
    #2                  16    male  ERR35180  ERR35179
    

    Now, create the parameter space:

    from snakemake.utils import Paramspace
    paramspace = Paramspace(df, filename_params='*')
    

    Finally, modify the rules to use the parameter space:

    rule all:
        input:
            paramspace.instance_patterns
    
    rule Mutect2:
        output:
            done=touch(paramspace.wildcard_pattern),
        params:
            parameters=paramspace.instance,
        shell:
            """
            echo {params.parameters[gender]}
            echo {params.parameters[tumor]}
            echo {params.parameters[control]}
            """
    

    Update:

    It's possible to adapt it to work with intermediate outputs. The parameter space acts as a pandas dataframe, so it's possible to select columns of interest:

    rule all:
        input:
            paramspace.instance_patterns,
    
    
    rule some_rule:
        output:
            test=paramspace[["gender"]].wildcard_pattern,
    
    
    rule Mutect2:
        input:
            test=paramspace[["gender"]].wildcard_pattern,
        output:
            done=touch(paramspace.wildcard_pattern),
        params:
            parameters=paramspace.instance,
        shell:
            """
            echo {params.parameters[gender]}
            echo {params.parameters[tumor]}
            echo {params.parameters[control]}
            """