Search code examples
snakemake

Snakemake running Subworkflow but not the Rest of my workflow (goes directly to rule All)


I'm a newbie in Snakemake and on StackOverflow. Don't hesitate to tell me if something is unclear or if you want any other detail. I have written a workflow permitting to convert .BCL Illumina Base Calls files to demultiplexed .FASTQ files and to generate QC report (FastQC files). This workflow is composed of :

  • Subworkflow "convert_bcl_to_fastq" It creates FASTQ files in a directory named Fastq from BCL files. It must be executed before the main workflow, this is why I have chosen to use a subworkflow since my second rule depends on the generation of these FASTQ files which I don't know the names in advance. A fake file "convert_bcl_to_fastq.done" is created as an output in order to know when this subworkflow ran as espected.
  • Rule "generate_fastqc" It takes the FASTQ files generated thanks to the subworkflow and creates FASTQC files in a directory named FastQC.

Problem

When I try to run my workflow, I don't have any error but my workflow does not behave as expected. I only get the Subworkflow to be ran and then, the main workflow but only the Rule "all" is executed. My Rule "generate_fastqc" is not executed at all. I would like to know where I could possibly have been wrong ? Here is what I get :

Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Job counts:
        count   jobs
        1       convert_bcl_to_fastq
        1
[...]
Processing completed with 0 errors and 1 warnings.
Touching output file convert_bcl_to_fastq.done.
Finished job 0.
1 of 1 steps (100%) done
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T171952.799414.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        1

localrule all:
    input: /path/to/my/working/directory/conversion/convert_bcl_to_fastq.done
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

And when all of my FASTQ files are generated, if I run again my workflow, this time it will execute the Rule "generate_fastqc".

Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Nothing to be done.
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T174337.605716.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        95      generate_fastqc
        96

I wanted my workflow to execute itself entirely by running rule "generate_fastqc" just after the completion of the subworkflow execution but I am actually forced to execute my workflow 2 times. I thought that this workflow would work since all the files needed in the second part of the workflow will be generated thanks to the subworkflow... Do you have any idea of where I could have been wrong ?


My Code

Here is my Snakefile for the main workflow :

subworkflow convert_bcl_to_fastq:
    workdir: WDIR + "conversion/"
    snakefile: WDIR + "conversion/Snakefile"

SAMPLES, = glob_wildcards(FASTQ_DIR + "{sample}_R1_001.fastq.gz")

rule all:
    input:
        convert_bcl_to_fastq("convert_bcl_to_fastq.done"),
        expand(FASTQC_DIR + "{sample}_R1_001_fastqc.html", sample=SAMPLES),
        expand(FASTQC_DIR + "{sample}_R2_001_fastqc.html", sample=SAMPLES)

rule generate_fastqc:
    output:
        FASTQC_DIR + "{sample}_R1_001_fastqc.html",
        FASTQC_DIR + "{sample}_R2_001_fastqc.html",
        temp(FASTQC_DIR + "{sample}_R1_001_fastqc.zip"),
        temp(FASTQC_DIR + "{sample}_R2_001_fastqc.zip")
    shell:
        "mkdir -p "+ FASTQC_DIR +" | " #Creates a FastQC directory if it is missing
        "fastqc --outdir "+ FASTQC_DIR +" "+ FASTQ_DIR +"{wildcards.sample}_R1_001.fastq.gz "+ FASTQ_DIR + " {wildcards.sample}_R2_001.fastq.gz &" #Generates FASTQC files for each sample at a time

Here is my Snakefile for the subworkflow "convert_bcl_to_fastq" :

rule all:
    input:
        "convert_bcl_to_fastq.done"

rule convert_bcl_to_fastq:
    output:
        touch("convert_bcl_to_fastq.done")
    shell:
        "mkdir -p "+ FASTQ_DIR +" | " #Creates a Fastq directory if it is missing
        "bcl2fastq --no-lane-splitting --runfolder-dir "+ INPUT_DIR +" --output-dir "+ FASTQ_DIR #Demultiplexes and Converts BCL files to FASTQ files

Thank you in advance for your help !


Solution

  • The documentation about subworkflows currently states:

    When executing, snakemake first tries to create (or update, if necessary) 
    "test.txt" (and all other possibly mentioned dependencies) by executing the subworkflow. 
    Then the current workflow is executed.
    

    In your case, the only dependency declared is "convert_bcl_to_fastq.done", which Snakemake happily produces the first time.

    Snakemake usually does a one-pass parsing, and the main workflow has not been told to look for sample-files from the subworkflow. Since sample-files do not exist yet during the first execution, the main workflow gets no match in the expand() statements. No match, no work to be done :-)

    When you run the main workflow the second time, it finds sample-matches in the expand() of rule all: and produces them.

    Side note 1: Be happy to have noticed this. Using your code, if you actually had done changes that mandated re-run of the subworkflow, Snakemake would find an old "convert_bcl_to_fastq.done" and not re-execute the subworkflow.

    Side note 2: If you want to make Snakemake be less 'one-pass' it has a rule-keyword checkpoint that can be used to re-evaluate what needs to be done as consequences of rule-execution. In your case, the checkpoint would have been rule convert_bcl_to_fastq . That would mandate the rules to be in the same logical snakefile (with include permitting multiple files though)