Search code examples
pythonwildcardsnakemake

Single output file per group for grouped samples using Snakemake


I'm trying to write a snakemake pipeline to process samples that are divided into groups which requires a single summary file per group. For examples, samples are divided as follows:

Group 1:
    Sample 1
    Sample 2
Group 2: 
    Sample 3 
    Sample 4

Each sample is process using bedtools to generate an output file per sample. Next, I need to summarize at the group level for each Group's collection of samples. An abbreviated snakemake file looks like this:

 rule intersect:
     input:
         bam = join('{group}','{sample}.bam'),
         reg_bed = join('{group}', 'region.bed')
     output:
         reg_intersect = join('{group}', '{sample}.intersect.bed')
    shell:
        'bedtools intersect -wa -wb -a {input.reg_bed} -b {input.bam} > {output.reg_intersect}'

rule aggregate:
    input:
        rules.interesect.output
    output:
        join('{group}','summary_stats.csv')
    shell:
        touch(join('{group}','summary_stats.csv'))
        #this would be a call to a python function that operates on all the output files to generate a single file

I get complaints that wildcards not agreeing between input and output (input contains {group} and {sample} while output only has {group}. I've tried using expand() but I have to include values for both group and sample, and sample isn't possible because it is dependent upon group.

Any suggestions are welcome.


Solution

  • Like @Maarten-vd-Sande said, the best is to use an input function to get all samples for a group in your aggregate rule.

    samplesInGroups = {"group1":["sample1","sample2"],"group2":["sample3","sample4"]}
    
    def getSamplesInGroup(wildcards):
        return [join(wildcards.group,s+".intersect.bed") for s in samplesInGroups[wildcards.group]]
    
    rule all:
         input:  expand("{group}/summary_stats.csv",group=list(samplesInGroups.keys()))
    
    rule intersect:
         input:
             bam = join('{group}','{sample}.bam'),
             reg_bed = join('{group}', 'region.bed')
         output:
             reg_intersect = join('{group}', '{sample}.intersect.bed')
         shell:
            'bedtools intersect -wa -wb -a {input.reg_bed} -b {input.bam} > {output.reg_intersect}'
    
    rule aggregate:
         input:
            getSamplesInGroup
         output:
            join('{group}','summary_stats.csv')
         shell:
            "python ... {input}"