Search code examples
wildcardsnakemake

How to select all files from one sample?


I have a problem figuring out how to make the input directive only select all {samples} files in the rule below.

rule MarkDup:
    input:
        expand("Outputs/MergeBamAlignment/{samples}_{lanes}_{flowcells}.merged.bam", zip,
            samples=samples['sample'],
            lanes=samples['lane'],
            flowcells=samples['flowcell']),
    output:
        bam = "Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
        metrics = "Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics",
    shell:
        "gatk --java-options -Djava.io.tempdir=`pwd`/tmp \
        MarkDuplicates \
        $(echo ' {input}' | sed 's/ / --INPUT /g') \
        -O {output.bam} \
        --VALIDATION_STRINGENCY LENIENT \
        --METRICS_FILE {output.metrics} \
        --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 200000 \
        --CREATE_INDEX true \
        --TMP_DIR Outputs/MarkDuplicates/tmp"

Currently it will create correctly named output files, but it selects all files that match the pattern based on all wildcards. So I'm perhaps halfway there. I tried changing {samples} to {{samples}} in the input directive as such:

expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", zip,
            lanes=samples['lane'],
            flowcells=samples['flowcell']),`

but this broke the previous rule somehow. So the solution is something like

input:
     "{sample}_*.bam"

But clearly this doesn't work. Is it possible to collect all files that match {sample}_*.bam with a function and use that as input? And if so, will the function still work with $(echo ' {input}' etc...) in the shell directive?


Solution

  • If you just want all the files in the directory, you can use a lambda function

    from glob import glob
    
    rule MarkDup:
        input:
            lambda wcs: glob('Outputs/MergeBamAlignment/%s*.bam' % wcs.samples)
        output:
            bam="Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
            metrics="Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics"
        shell:
            ...
    

    Just be aware that this approach can't do any checking for missing files, since it will always report that the files needed are the files that are present. If you do need confirmation that the upstream rule has been executed, you can have the previous rule touch a flag, which you then require as input to this rule (though you don't actually use the file for anything other than enforcing execution order).