Search code examples
bioinformaticssnakemake

unknown output in snakemake


I'm working on implementing a very simple pipeline in snakemake in hopes of replacing a chain of annoying bash scripts with one cohesive Snakefile.

I'm having trouble writing a rule that splits a file into smaller pieces (using GNU split), and then leads to a second rule where the output is concatenated together.

I don't know what to write for the input in the concat step, since I don't know how to define all the files fitting the pattern bam_files/test*. I tried with glob, but that decidedly doesn't seem to work (it seems like it's actually skipping split altogether with the glob included). Is there any better way that I could be doing this?

# test snakemake pipeline
import glob


SAMPLE_IDS = ["test"]

rule all: 
    input: 
        expand("bam_files/{FASTQ}.out", FASTQ=SAMPLE_IDS)


rule split: 
    input: 
        expand("{FASTQ}.txt", FASTQ=SAMPLE_IDS)
    output: 
        "bam_files/{FASTQ}."
    shell:
        "cat {input} | split -l 1000 -d - {output}."


rule concat: 
    input:
        split_files = glob.glob("bam_files/{FASTQ}.*")
    output: 
        "bam_files/{FASTQ}.out"
    shell: 
        "cat {input} > {output}"

Solution

  • I think this should work:

    SAMPLE_IDS = ["test"]
    
    rule all: 
        input: 
            expand("bam_files/{FASTQ}.out", FASTQ=SAMPLE_IDS)
    
    
    rule split: 
        input: 
            "{FASTQ}.txt"
        output: 
            dynamic("bam_files/{FASTQ}.{PART}")
        params:
            length=1000
        shell:
            "cat {input} | split -l {params.length} -d - bam_files/{FASTQ}."
    
    
    rule concat: 
        input:
            split_files = dynamic("bam_files/{FASTQ}.{PART}")
        output: 
            "bam_files/{FASTQ}.out"
        shell: 
            "cat {input} > {output}"
    

    It looks like the split rule should be taking one file {FASTQ}.txt at a time and producing {FASTQ}.1, {FASTQ}.2, ... or something similar. Because you don't know ahead of time how many files it will produce, you need to use dynamic() for both split.output and concat.input.