Search code examples
bioinformaticssnakemake

Executing checkpoint intermediate Commands


I've currently been running into some issues with snakemake running intermediate rules required by a checkpoint. After attempting to trouble shoot this issue, I believe the problem lies within the expand command in the aggregate_input function, but cannot figure out why it is behaving the way it is.

Here is the current checkpoint documentation from snakemake which I have modeled this after https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#data-dependent-conditional-execution

rule all:
    input:
    ¦   expand("string_tie_assembly/{sample}.gtf", sample=sample),
    ¦   expand("combined_fasta/{sample}.fa", sample=sample),
    ¦   "aggregated_fasta/all_fastas_combined.fa"




checkpoint clustering:
    input:
    ¦   "string_tie_assembly_merged/merged_{sample}.gtf"
    output:
    ¦   clusters = directory("split_gtf_file/{sample}")
    shell:
    ¦   """
    ¦   mkdir -p split_gtf_file/{wildcards.sample} ;

collapse_gtf_file.py -gtf {input} -o split_gtf_file/{wildcards.sample}/{wildcards.sample}
    ¦   """

rule gtf_to_fasta:
    input:
    ¦   "split_gtf_file/{sample}/{sample}_{i}.gtf"
    output:
    ¦   "lncRNA_fasta/{sample}/canidate_{sample}_{i}.fa"
    shell:
    ¦   "gffread -w {output} -g {reference} {input}"

rule rename_fasta_files:
    input:
    ¦   "lncRNA_fasta/{sample}/canidate_{sample}_{i}.fa"
    output:
    ¦   "lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa"
    shell:
    ¦   "seqtk rename {input} {wildcards.sample}_{i} > {output}"

#Gather N number of output files from the GTF split
def aggregate_input(wildcards):
    checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
    x = expand("lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa",
    ¦   sample=sample,
    ¦   i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fa")).i)
    print(x)
    return x

#Aggregate fasta from split GTF files together
rule combine_fasta_file:
    input:
    ¦   aggregate_input
    output:
    ¦   "combined_fasta/{sample}.fa"
    shell:
        "cat {input} > {output}"


    ¦   aggregate_input
    output:
    ¦   "combined_fasta/{sample}.fa"
    shell:
    ¦   "cat {input} > {output}"

#Aggegate aggregated fasta files
def gather_files(wildcards):
    files = expand("combined_fasta/{sample}.fa", sample=sample)
    return(files)

rule aggregate_fasta_files:
    input:
    ¦   gather_files
    output:
    ¦  "aggregated_fasta/all_fastas_combined.fa"
    shell:
    ¦   "cat {input} > {output}"

The issue I keep running into is that upon running this snakemake file the combine_fasta_file rule does not run. After spending more time with this error I realized that the issue was the aggregate_input function was not expanding, and returns an empty list [] instead of what I expect which is a list of all files in the in the directory expanded, ie: lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa.

This is odd especially given the fact that checkpoint clustering does run correctly, and the downstream output files are in the rule all

Does anyone have any idea why this would be the case? Or thing of a reason this might be the case.

Command used to run snakemake: snakemake -rs Assemble_regions.snake --configfile snake_config_files/annotated_group_config.yaml


Solution

  • Just figured this out. The issue was my aggregate command targeting the wrong file. Previously I had it written as

    def aggregate_input(wildcards):
        checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
        x = expand("lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa",
        ¦   sample=sample,
        ¦   i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fa")).i)
        print(x)
        return x
    

    This issue however, is it was targeting the wrong files. Instead of globbig {i}.fa, it should be globbing the files produced from the checkpoint clustering. So changing this code to

    def aggregate_input(wildcards):
        checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
        print(checkpoint_output)
        x = expand("lncRNA_fasta_renamed/{sample}/{sample}_{i}.fa",
        ¦   sample=wildcards.sample,
        ¦   i=glob_wildcards(os.path.join(checkpoint_output, "{sample}_{i}.gtf")).i)
        print(x)
        return x
    

    Solved the issue.