Search code examples
bioinformaticssnakemake

How to handle mutliple output filename conventions with Snakemake


I have a scenario where the output filenames of a program are not consistent and hence I am having difficulty defining snakemake rules.

I am retrieving data from SRA and I do not know if the program will return 1 file or 2 files

Here are the 2 situations :

Situation 1

Retrieve SE data - this generates 1 output file named SRR5597649_1.fastq.gz

parallel-fastq-dump --sra-id SRR5597649 --threads 16 --outdir out/ --gzip --skip-technical --readids --dumpbase --split-files --clip

Situation 2

Retrieve PE data - this generates 2 output files SRR390728_1.fastq.gz  SRR390728_2.fastq.gz

parallel-fastq-dump --sra-id SRR390728 --threads 16 --outdir out/ --gzip --skip-technical --readids --dumpbase --split-files --clip

I would like to know from the Snakemake community how they would address the above situation and how would one define the individual rules and also the rule:all

Note: I am adding more information to this question after trying out @dariober solution

Snakefile

shell.executable("bash")
import pandas as pd
import glob
import os

configfile: "config.json"

data_dir=os.getcwd()

units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())

rule all:
    input:
       expand("raw_fastq/{sample}.dump.done", sample=samples),
       expand("results/salmon/{sample}_salmon_quant",sample=samples),
       expand("results/logs/salmon/{sample}.salmon.log",sample=samples)

rule clean:
     shell: "rm -rf .snakemake/"

include: 'rules/download_sra.smk'
include: 'rules/salmon_quant.smk'

download_sra.smk

rule download_sra:
    """
    Download RNA-Seq data from SRA.
    """
    output: touch("raw_fastq/{sample}.dump.done"),
    params:
        outdir = "raw_fastq",
        threads = 16
    priority:85
    shell: "parallel-fastq-dump --sra-id {wildcards.sample} ... "

salmon_quant.smk

rule salmon_quant:
     input:
             touchfile = 'raw_fastq/{sample}.dump.done',
             index = config['salmon_rat_gentrome_Index']
     output:
             directory("results/salmon/{sample}_salmon_quant")
     log:
             'results/logs/salmon/{sample}.salmon.log'
     priority: 85
     threads: 20
     run:
        fq= sorted(glob.glob(os.path.join('raw_fastq',  wildcards.sample + '_[12].fastq.gz')))

        if len(fq) == 1:
            shell("salmon quant -i {input.index} -l A  -r %s  ..." % fq) # Process as SE
        elif len(fq) == 2:
            shell("salmon quant -i {input.index} -l A -1 %s -2 %s ...." % (fq[0], fq[1])) # Process as PE
        else:
            raise Exception('Wrong number of fastq files!')

Problem I am facing: When I try the solution suggested by 'dariober', there seems to be something wrong when the %s is being replaced when the shell command runs. Like for example, it is getting replaced as

-r ['raw_fastq/SRR5597645_1.fastq.gz']

whereas it should have been replaced as

-r raw_fastq/SRR5597645_1.fastq.gz

Probably, I am getting it wrong somewhere - as shown in the code above, the %s is causing a tagging along a square bracket and a single quote.

Also, I would like to have this wrapped with a temp directive in the download_sra rules such that the fastq files and touchfiles can be deleted automatically after the job finishes.

Thanks in advance to all.


Solution

  • One solution may be to touch a file when fastq-dump has completed and use that file as input to the other rules.

    In the case you posted, the name of the actual fastq files can be reconstructed from the SRR id and you can tell whether you have 1 or 2 fastq files based on the suffix _1.fastq.gz or _2.fastq.gz.

    Here's an example:

    import glob
    import os
    
    sra_id= ['SRR1234', 'SRR4567']
    
    rule all:
        input:
            expand('{sra_id}.dump.done', sra_id= sra_id),
            expand('{sra_id}.bam', sra_id= sra_id),
    
    rule sra_dump:
        output:
            touch('{sra_id}.dump.done'),
        shell:
            r"""
            parallel-fastq-dump --sra-id {wildcards.sra_id} ... 
            """
    
    rule align:
        input:
            '{sra_id}.dump.done',
        output:
            '{sra_id}.bam',
        run:
            fq= sorted(glob.glob(os.path.join(wildcards.sra_id, wildcards.sra_id + '_[12].fastq.gz')))
    
            if len(fq) == 1:
                shell("bwa mem %s ..." % fq) # Process as SE
            elif len(fq) == 2:
                shell("bwa mem %s %s ..." % (fq[0], fq[1])) # Process as PE
            else:
                raise Exception('Wrong number of fastq files!')