Search code examples
snakemake

Snakemake Rule to simplify file names using wildcards


Input Data

.
├── barcode01
│   └── fastq_runid_6292747b0109c4fa5918c50eb8204bb715f19ad0_0.fastq
├── barcode02
│   └── fastq_runid_6292747b0109c4fa5918c50eb8204bb715f19ad0_0.fastq
├── barcode03
│   └── fastq_runid_6292747b0109c4fa5918c50eb8204bb715f19ad0_0.fastq
├── barcode04
│   └── fastq_runid_6292747b0109c4fa5918c50eb8204bb715f19ad0_0.fastq

Snakemake rule

rule symlink_results_demultiplex:
    input:
        inputdirectory+"/basecall/demultiplex/{sample_demultiplex}/{sample_runid}.fastq"
    output:
        outdirectory+"/mothur/{sample_demultiplex}.fastq"
    threads: 1
    shell:
        "ln -s {input} {output}"

However this errors because the same wildcards aren't used. I would like to create a symlink with just the barcode01.fastq as output file. I want to remove the redundant "fastq_runid_6292747b0109c4fa5918c50eb8204bb715f19ad0_0" part.

What would be the best way to do this?


Solution

  • One option would be to find the filename for the input in a function that only depends on the {sample_demultiplex} wildcard. This example code might work for this depending on exactly how your folders are set up (right now it assumes that each {sample_demultiplex} wilcard only ever corresponds to a single fastq file.)

    import os
    import glob
    
    def get_symlink_results_demultiplex_input(wildcards):
        fastq_dir = os.path.join(inputdirectory, "/basecall/demultiplex/", wildcards.sample_demultiplex)
        fastq_file = glob.glob("*.fastq", root_dir=fastq_dir)[0] # this assumes there is only ever one fastq file in a directory
        return os.path.join(fastq_dir, fastq_file)
        
    
    rule symlink_results_demultiplex:
        input:
            get_symlink_results_demultiplex_input
        output:
            outdirectory+"/mothur/{sample_demultiplex}.fastq"
        threads: 1
        shell:
            "ln -s {input} {output}"