I'm new to Snakemake and try to use specific files in a rule, from the directory()
output of another rule that clones a git repo.
Currently, this gives me an error Wildcards in input files cannot be determined from output files: 'json_file'
, and I don't understand why. I have previously worked through the tutorial at https://carpentries-incubator.github.io/workflows-snakemake/index.html.
The difference between my workflow and the tutorial workflow is that I want to create the data I use later in the first step, whereas in the tutorial, the data was already there.
Workflow description in plain text:
GIT_PATH = config['git_local_path'] # git/
PARSED_JSON_PATH = f'{GIT_PATH}parsed/'
GIT_URL = config['git_url']
# A single parsed JSON file
PARSED_JSON_FILE = f'{PARSED_JSON_PATH}{{json_file}}.json'
# Build a list of parsed JSON file names
PARSED_JSON_FILE_NAMES = glob_wildcards(PARSED_JSON_FILE).json_file
# All parsed JSON files
ALL_PARSED_JSONS = expand(PARSED_JSON_FILE, json_file=PARSED_JSON_FILE_NAMES)
rule all:
input: 'result.json'
rule clone_git:
output: directory(GIT_PATH)
threads: 1
conda: f'{ENVS_DIR}git.yml'
shell: f'git clone --depth 1 {GIT_URL} {{output}}'
rule extract_json:
input:
cmd='scripts/extract_json.py',
json_file=PARSED_JSON_FILE
output: 'result.json'
threads: 50
shell: 'python {input.cmd} {input.json_file} {output}'
Running only clone_git
works fine (if I set an all
input
of GIT_PATH
).
Why do I get the error message? Is this because the JSON files don't exist when the workflow is started?
Also - I don't know if this matters - this is a subworkflow used with module
.
What you need seems to be a checkpoint
rule which is first executed and only then snakemake
determines which .json
files are present and runs your extract/aggregate functions. Here's an example adapted:
I'm struggling to fully understand the file and folder structure you get after cloning your git repo. So I have fallen back to the best practices by Snakemake of using resources
for downloaded and results
for created files.
You'll need to re-adjust those paths to match your case again:
GIT_PATH = config["git_local_path"] # git/
GIT_URL = config["git_url"]
checkpoint clone_git:
output:
git=directory(GIT_PATH),
threads: 1
conda:
f"{ENVS_DIR}git.yml"
shell:
f"git clone --depth 1 {GIT_URL} {{output.git}}"
rule extract_json:
input:
cmd="scripts/extract_json.py",
json_file="resources/{file_name}.json",
output:
"results/parsed_files/{file_name}.json",
shell:
"python {input.cmd} {input.json_file} {output}"
def get_all_json_file_names(wildcards):
git_dir = checkpoints.clone_git.get(**wildcards).output["git"]
file_names = glob_wildcards(
"resources/{file_name}.json"
).file_name
return expand(
"results/parsed_files/{file_name}.json",
file_name=file_names,
)
# Rule has checkpoint dependency: Only after the checkpoint is executed
# the rule is executed which then evaluates the function to determine all
# json files downloaded from the git repo
rule aggregate:
input:
get_all_json_file_names
output:
"result.json",
default_target: True
shell:
# TODO: Action which combines all JSON files
edit: Moved the expand(...)
from rule aggregate
into get_all_json_file_names
.