Before this, I checked this, snakemake's documentation, this,and this. Maybe they actually answered this question but I just didn't understand it.
In short, I create in one rule a number of files from other files, that both conform to a wildcard format. I don't know how many of these I create, since I don't know how many I originally download.
In all of the examples I've read so far, the output is directory("the/path"), while I have a "the/path/{id}.txt. So this I guess modifies how I call the checkpoints in the function itself. And the use of expand.
The rules in question are:
download_mv
textgrid_to_ctm_txt
get_MV_IDs
merge_ctms
The order of the rules should be:
download_mv (creates {MV_ID}.TEX and .wav (though not necessarily the same amount)
textgrid_to_ctm_txt (creates from {MV_ID}.TEX matching .txt and .ctm)
get_MV_IDs (should make a list of the .ctm files)
merge_ctms (should concatenate the ctm files)
kaldi_align (from the .wav and .txt directories creates one ctm file)
analyse_align (compares ctm file from kaldi_align the the merge_ctms)
upload_print_results
I have tried with the outputs of download_mv being directories, and then trying to get the IDs but I had different errors then. Now with snakemake --dryrun
I get
Building DAG of jobs...
InputFunctionException in line 40 of Snakefile:
Error:
WorkflowError:
Missing wildcard values for MV_ID
Wildcards:
Traceback:
File "Snakefile", line 35, in get_MV_IDs
The Snakefile is:
import os
rule all:
input:
"mv_data/results.txt"
rule download_mv:
params:
allas="allas:2004354-mv/mv/",
textgrid="mv_data/TEXTGRID",
wav="mv_data/wav",
output:
textgrid="mv_data/TEXTGRID/{MV_ID}.TEX",
wav="mv_data/wav/{MV_ID}.wav",
shell:'''
rclone copy {params.allas}/mv_original/TEXTGRID/ {params.textgrid}
rclone copy {params.allas}/mv_added/wav/ {params.wav}
'''
checkpoint textgrid_to_ctm_txt:
input:
textgrid="mv_data/TEXTGRID/{MV_ID}.TEX",
script="data_preparation/read_mv_TextGrid.py"
output:
ctm="mv_data/ctm/{MV_ID}.ctm",
txt="mv_data/txt/{MV_ID}.txt",
shell:
"python {input.script} {input.textgrid}"
def get_MV_IDs(wildcards):
checkpoint_output = checkpoints.textgrid_to_ctm_txt.get(**wildcards).output[0]
TMP_VAR, = glob_wildcards(os.path.join(checkpoint_output,"{MV_ID}.TEX"))
return expand(os.path.join(checkpoint_output,"{mv_id}.TEX"),mv_id=TMP_VAR)
rule merge_ctms:
input:
get_MV_IDs
output:
gold_ctm="mv_data/ctm/ctm",
shell:
"cat {input.get_MV_IDs} > {output.gold_ctm}"
rule kaldi_align:
input:
gold_ctm="mv_data/ctm/ctm",
script="interfaces/kaldi-align.py",
params:
alignments="mv_data/align",
wav="mv_data/wav",
output:
created_ctm="mv_data/align/ctm",
shell:'''
python {input.script} --wav {params.wav} --txt mv_data/txt --lang fi {params.alignments} < kaldi-align_prompts
'''
rule analyse_align:
input:
script="analysis/calculate_metrics.py",
gold_ctm="mv_data/ctm/ctm",
created_ctm="mv_data/align/ctm",
output:
results="mv_data/results.txt"
shell:
"python -m analysis.calculate_metrics {input.gold_ctm} {input.created_ctm} mv > {output.results}"
rule upload_print_results:
input:
results="mv_data/results.txt",
params:
allas="allas:2004354-mv/mv/",
shell:
"rclone copyto {input.results} {params.allas}"
UPDATE
So I made this work by using bash functionalities instead of snakemake. I'd still appreciate it if someone could instruct me on how this should have been done:
import os
rule all:
input:
"mv_data/results.txt"
rule download_mv:
params:
allas="allas:2004354-mv/mv",
output:
textgrid=directory("mv_data/TEXTGRID"),
wav=directory("mv_data/wav"),
shell:'''
rclone copy {params.allas}/mv_original/TEXTGRID/ {output.textgrid}
rclone copy {params.allas}/mv_added/wav/ {output.wav}
'''
checkpoint textgrid_to_ctm_txt:
input:
textgrid="mv_data/TEXTGRID",
script="data_preparation/read_mv_TextGrid.py"
output:
ctm=directory("mv_data/ctm"),
txt=directory("mv_data/txt"),
shell:'''
for textgrid in {input.textgrid}/*.TEX;
do
python {input.script} "$textgrid";
done
'''
rule merge_ctms:
input:
ctm="mv_data/ctm"
output:
gold_ctm="mv_data/gold_ctm",
shell:
"cat {input.ctm}/*.ctm > {output.gold_ctm}"
rule kaldi_align:
input:
wav="mv_data/wav",
txt="mv_data/txt",
script="interfaces/kaldi_align.py",
params:
alignments="mv_data/align",
output:
created_ctm="mv_data/align/ctm",
shell:'''
python {input.script} --wav {input.wav} --txt mv_data/txt --lang fi {params.alignments} < kaldi-align_prompts
'''
rule analyse_align:
input:
script="analysis/calculate_metrics.py",
gold_ctm="mv_data/gold_ctm",
created_ctm="mv_data/align/ctm",
output:
results="mv_data/results.txt"
shell:
"python -m analysis.calculate_metrics {input.gold_ctm} {input.created_ctm} mv > {output.results}"
rule upload_print_results:
input:
results="mv_data/results.txt",
params:
allas="allas:2004354-mv/mv/",
shell:
"rclone copyto {input.results} {params.allas}"
I can see the reason why you got the error is:
You use input function in rule merge_ctms
to access the files generated by checkpoint. But merge_ctms
doesn't have a wildcard in output file name, snakemake didn't know which wildcard should be filled into MV_ID
in your checkpoint.
I'm also a bit confused about the way you use checkpoint, since you are not sure how many .TEX
files would be downloaded (I guess), shouldn't you use the directory that stores .TEX
as output of checkpoint, then use glob_wildcards
to find out how many .TEX
files you downloaded?
An alternative solution I can think of is to let download_mv
become your checkpoint and set the output as the directory containing .TEX
files, then in input function, replace the .TEX
files with .ctm
files to do the format conversion