I have a scenario where the output filenames of a program are not consistent and hence I am having difficulty defining snakemake rules.
I am retrieving data from SRA and I do not know if the program will return 1 file or 2 files
Here are the 2 situations :
Situation 1
Retrieve SE data - this generates 1 output file named SRR5597649_1.fastq.gz
parallel-fastq-dump --sra-id SRR5597649 --threads 16 --outdir out/ --gzip --skip-technical --readids --dumpbase --split-files --clip
Situation 2
Retrieve PE data - this generates 2 output files SRR390728_1.fastq.gz SRR390728_2.fastq.gz
parallel-fastq-dump --sra-id SRR390728 --threads 16 --outdir out/ --gzip --skip-technical --readids --dumpbase --split-files --clip
I would like to know from the Snakemake community how they would address the above situation and how would one define the individual rules and also the rule:all
Note: I am adding more information to this question after trying out @dariober solution
Snakefile
shell.executable("bash")
import pandas as pd
import glob
import os
configfile: "config.json"
data_dir=os.getcwd()
units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())
rule all:
input:
expand("raw_fastq/{sample}.dump.done", sample=samples),
expand("results/salmon/{sample}_salmon_quant",sample=samples),
expand("results/logs/salmon/{sample}.salmon.log",sample=samples)
rule clean:
shell: "rm -rf .snakemake/"
include: 'rules/download_sra.smk'
include: 'rules/salmon_quant.smk'
download_sra.smk
rule download_sra:
"""
Download RNA-Seq data from SRA.
"""
output: touch("raw_fastq/{sample}.dump.done"),
params:
outdir = "raw_fastq",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {wildcards.sample} ... "
salmon_quant.smk
rule salmon_quant:
input:
touchfile = 'raw_fastq/{sample}.dump.done',
index = config['salmon_rat_gentrome_Index']
output:
directory("results/salmon/{sample}_salmon_quant")
log:
'results/logs/salmon/{sample}.salmon.log'
priority: 85
threads: 20
run:
fq= sorted(glob.glob(os.path.join('raw_fastq', wildcards.sample + '_[12].fastq.gz')))
if len(fq) == 1:
shell("salmon quant -i {input.index} -l A -r %s ..." % fq) # Process as SE
elif len(fq) == 2:
shell("salmon quant -i {input.index} -l A -1 %s -2 %s ...." % (fq[0], fq[1])) # Process as PE
else:
raise Exception('Wrong number of fastq files!')
Problem I am facing: When I try the solution suggested by 'dariober', there seems to be something wrong when the %s
is being replaced when the shell command runs. Like for example, it is getting replaced as
-r ['raw_fastq/SRR5597645_1.fastq.gz']
whereas it should have been replaced as
-r raw_fastq/SRR5597645_1.fastq.gz
Probably, I am getting it wrong somewhere - as shown in the code above, the %s
is causing a tagging along a square bracket and a single quote.
Also, I would like to have this wrapped with a temp
directive in the download_sra rules such that the fastq files and touchfiles can be deleted automatically after the job finishes.
Thanks in advance to all.
One solution may be to touch a file when fastq-dump has completed and use that file as input to the other rules.
In the case you posted, the name of the actual fastq files can be reconstructed from the SRR id and you can tell whether you have 1 or 2 fastq files based on the suffix _1.fastq.gz
or _2.fastq.gz
.
Here's an example:
import glob
import os
sra_id= ['SRR1234', 'SRR4567']
rule all:
input:
expand('{sra_id}.dump.done', sra_id= sra_id),
expand('{sra_id}.bam', sra_id= sra_id),
rule sra_dump:
output:
touch('{sra_id}.dump.done'),
shell:
r"""
parallel-fastq-dump --sra-id {wildcards.sra_id} ...
"""
rule align:
input:
'{sra_id}.dump.done',
output:
'{sra_id}.bam',
run:
fq= sorted(glob.glob(os.path.join(wildcards.sra_id, wildcards.sra_id + '_[12].fastq.gz')))
if len(fq) == 1:
shell("bwa mem %s ..." % fq) # Process as SE
elif len(fq) == 2:
shell("bwa mem %s %s ..." % (fq[0], fq[1])) # Process as PE
else:
raise Exception('Wrong number of fastq files!')