Search code examples
snakemake

How do Snakemake checkpoints work when i do not wanna make a folder?


I have a snakemake file where one rule produces a file from witch i would like to extract the header and use as wildcards in my rule all. The Snakemake guide provides an example where it creates new folders named like the wildcards, but if I can avoid that it would be nice since in some cases it would need to create 100-200 folders then. Any suggestions on how to make it work?

link to snakemake guide: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

import pandas as pd

rule all:
    input: 
        final_report = expand('report_{fruit}.txt', fruit= ???)

rule create_file:
    input:
    output:
        fruit = 'fruit_file.csv'
    run:
        ....

rule next:
    input:
        fruit = 'fruit_file.csv'
    output:
        report = 'report_{phenotype}.txt'
    run:
        fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
        fruits= fruit_file.columns.tolist()[2:]
        for i in fruits:
            cmd = 'touch report_' + i + '.txt'
            shell(cmd)

This is a simplified workflow since i am actually using some long script to both produce the pheno_file.csv and the report files.

The pheno_file.csv is tab-seperated and could look like this:

FID IID Apple   Banana  Plum
Mouse   Mickey  0   0   1
Mouse Minnie    1   0   1
Duck    Donnald 0   1   0

Solution

  • I think you are misreading the snakemake checkpoint example. You only need to create one folder in your case. They have a wildcard (sample) in the folder name, but that part of the output name is known ahead of time.

    checkpoint fruit_reports:
        input:
            fruit = 'fruit_file.csv'
        output:
            report_dir = directory('reports')
        run:
            fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
            fruits= fruit_file.columns.tolist()[2:]
            for i in fruits:
                cmd = f'touch {output}/report_{i}.txt'
                shell(cmd)
    

    Since you do not know all names (fruits) ahead of time, you cannot include them in the all rule. You need to reference an intermediate rule to bring everything together. Maybe use a final report file:

    rule all:
       input: 'report.txt'
    

    Then after the checkpoint:

    def aggregate_fruit(wildcards):
         checkpoint_output = checkpoints.fruit_reports.get(**wildcards).output[0]
         return expand("reports/report_{i}.txt",
                        i=glob_wildcards(os.path.join(checkpoint_output, "report_{i}.txt")).i)
    
    
    rule report:
        input:
            aggregate_input
        output:
            "report.txt"
        shell:
            "ls 1 {input} > {output}"