Snakemake: Use checkpoint and function to aggregate unknown number of files using wildcards

Before this, I checked this, snakemake's documentation, this,and this. Maybe they actually answered this question but I just didn't understand it.

In short, I create in one rule a number of files from other files, that both conform to a wildcard format. I don't know how many of these I create, since I don't know how many I originally download.

In all of the examples I've read so far, the output is directory("the/path"), while I have a "the/path/{id}.txt. So this I guess modifies how I call the checkpoints in the function itself. And the use of expand.

The rules in question are:

download_mv

textgrid_to_ctm_txt

get_MV_IDs

merge_ctms

The order of the rules should be:

download_mv (creates {MV_ID}.TEX and .wav (though not necessarily the same amount)

textgrid_to_ctm_txt (creates from {MV_ID}.TEX matching .txt and .ctm)

get_MV_IDs (should make a list of the .ctm files)

merge_ctms (should concatenate the ctm files)

kaldi_align (from the .wav and .txt directories creates one ctm file)

analyse_align (compares ctm file from kaldi_align the the merge_ctms)

upload_print_results

I have tried with the outputs of download_mv being directories, and then trying to get the IDs but I had different errors then. Now with snakemake --dryrun I get

Building DAG of jobs...
InputFunctionException in line 40 of Snakefile:
Error:
  WorkflowError:
    Missing wildcard values for MV_ID
Wildcards:

Traceback:
  File "Snakefile", line 35, in get_MV_IDs

The Snakefile is:

import os


rule all:
  input:
    "mv_data/results.txt"


rule download_mv:
  params:
    allas="allas:2004354-mv/mv/",
    textgrid="mv_data/TEXTGRID",
    wav="mv_data/wav",
  output:
    textgrid="mv_data/TEXTGRID/{MV_ID}.TEX",
    wav="mv_data/wav/{MV_ID}.wav",
  shell:'''
    rclone copy {params.allas}/mv_original/TEXTGRID/ {params.textgrid}
    rclone copy {params.allas}/mv_added/wav/ {params.wav}
    '''


checkpoint textgrid_to_ctm_txt:
  input:
    textgrid="mv_data/TEXTGRID/{MV_ID}.TEX",
    script="data_preparation/read_mv_TextGrid.py"
  output:
    ctm="mv_data/ctm/{MV_ID}.ctm",
    txt="mv_data/txt/{MV_ID}.txt",
  shell:
    "python {input.script} {input.textgrid}"


def get_MV_IDs(wildcards):
  checkpoint_output = checkpoints.textgrid_to_ctm_txt.get(**wildcards).output[0]
  TMP_VAR, = glob_wildcards(os.path.join(checkpoint_output,"{MV_ID}.TEX"))
  return expand(os.path.join(checkpoint_output,"{mv_id}.TEX"),mv_id=TMP_VAR)


rule merge_ctms:
  input:
    get_MV_IDs
  output:
    gold_ctm="mv_data/ctm/ctm",
  shell:
    "cat {input.get_MV_IDs} > {output.gold_ctm}"


rule kaldi_align:
  input:
    gold_ctm="mv_data/ctm/ctm",
    script="interfaces/kaldi-align.py",
  params:
    alignments="mv_data/align",
    wav="mv_data/wav",
  output:
    created_ctm="mv_data/align/ctm",
  shell:'''
    python {input.script} --wav {params.wav} --txt mv_data/txt --lang fi {params.alignments} < kaldi-align_prompts
    '''


rule analyse_align:
  input:
    script="analysis/calculate_metrics.py",
    gold_ctm="mv_data/ctm/ctm",
    created_ctm="mv_data/align/ctm",
  output:
    results="mv_data/results.txt"
  shell:
    "python -m analysis.calculate_metrics {input.gold_ctm} {input.created_ctm} mv > {output.results}"


rule upload_print_results:
  input:
    results="mv_data/results.txt",
  params:
    allas="allas:2004354-mv/mv/",
  shell:
    "rclone copyto {input.results} {params.allas}"

UPDATE

So I made this work by using bash functionalities instead of snakemake. I'd still appreciate it if someone could instruct me on how this should have been done:

import os


rule all:
  input:
    "mv_data/results.txt"


rule download_mv:
  params:
    allas="allas:2004354-mv/mv",
  output:
    textgrid=directory("mv_data/TEXTGRID"),
    wav=directory("mv_data/wav"),
  shell:'''
    rclone copy {params.allas}/mv_original/TEXTGRID/ {output.textgrid}
    rclone copy {params.allas}/mv_added/wav/ {output.wav}
    '''


checkpoint textgrid_to_ctm_txt:
  input:
    textgrid="mv_data/TEXTGRID",
    script="data_preparation/read_mv_TextGrid.py"
  output:
    ctm=directory("mv_data/ctm"),
    txt=directory("mv_data/txt"),
  shell:'''
    for textgrid in {input.textgrid}/*.TEX;
    do
      python {input.script} "$textgrid";
    done
    '''

rule merge_ctms:
  input:
    ctm="mv_data/ctm"
  output:
    gold_ctm="mv_data/gold_ctm",
  shell:
    "cat {input.ctm}/*.ctm > {output.gold_ctm}"


rule kaldi_align:
  input:
    wav="mv_data/wav",
    txt="mv_data/txt",
    script="interfaces/kaldi_align.py",
  params:
    alignments="mv_data/align",
  output:
    created_ctm="mv_data/align/ctm",
  shell:'''
    python {input.script} --wav {input.wav} --txt mv_data/txt --lang fi {params.alignments} < kaldi-align_prompts
    '''

rule analyse_align:
  input:
    script="analysis/calculate_metrics.py",
    gold_ctm="mv_data/gold_ctm",
    created_ctm="mv_data/align/ctm",
  output:
    results="mv_data/results.txt"
  shell:
    "python -m analysis.calculate_metrics {input.gold_ctm} {input.created_ctm} mv > {output.results}"


rule upload_print_results:
  input:
    results="mv_data/results.txt",
  params:
    allas="allas:2004354-mv/mv/",
  shell:
    "rclone copyto {input.results} {params.allas}"

Solution

I can see the reason why you got the error is:

You use input function in rule merge_ctms to access the files generated by checkpoint. But merge_ctms doesn't have a wildcard in output file name, snakemake didn't know which wildcard should be filled into MV_ID in your checkpoint.

I'm also a bit confused about the way you use checkpoint, since you are not sure how many .TEX files would be downloaded (I guess), shouldn't you use the directory that stores .TEX as output of checkpoint, then use glob_wildcards to find out how many .TEX files you downloaded?

An alternative solution I can think of is to let download_mv become your checkpoint and set the output as the directory containing .TEX files, then in input function, replace the .TEX files with .ctm files to do the format conversion