I'm working on implementing a very simple pipeline in snakemake in hopes of replacing a chain of annoying bash scripts with one cohesive Snakefile.
I'm having trouble writing a rule that splits a file into smaller pieces (using GNU split), and then leads to a second rule where the output is concatenated together.
I don't know what to write for the input in the concat step, since I don't know how to define all the files fitting the pattern bam_files/test*
. I tried with glob, but that decidedly doesn't seem to work (it seems like it's actually skipping split altogether with the glob included). Is there any better way that I could be doing this?
# test snakemake pipeline
import glob
SAMPLE_IDS = ["test"]
rule all:
input:
expand("bam_files/{FASTQ}.out", FASTQ=SAMPLE_IDS)
rule split:
input:
expand("{FASTQ}.txt", FASTQ=SAMPLE_IDS)
output:
"bam_files/{FASTQ}."
shell:
"cat {input} | split -l 1000 -d - {output}."
rule concat:
input:
split_files = glob.glob("bam_files/{FASTQ}.*")
output:
"bam_files/{FASTQ}.out"
shell:
"cat {input} > {output}"
I think this should work:
SAMPLE_IDS = ["test"]
rule all:
input:
expand("bam_files/{FASTQ}.out", FASTQ=SAMPLE_IDS)
rule split:
input:
"{FASTQ}.txt"
output:
dynamic("bam_files/{FASTQ}.{PART}")
params:
length=1000
shell:
"cat {input} | split -l {params.length} -d - bam_files/{FASTQ}."
rule concat:
input:
split_files = dynamic("bam_files/{FASTQ}.{PART}")
output:
"bam_files/{FASTQ}.out"
shell:
"cat {input} > {output}"
It looks like the split
rule should be taking one file {FASTQ}.txt
at a time and producing {FASTQ}.1, {FASTQ}.2, ...
or something similar. Because you don't know ahead of time how many files it will produce, you need to use dynamic()
for both split.output
and concat.input
.