Search code examples
wildcardsnakemake

Snakemake wildcards: Using wildcarded files from directory output


I'm new to Snakemake and try to use specific files in a rule, from the directory() output of another rule that clones a git repo.

Currently, this gives me an error Wildcards in input files cannot be determined from output files: 'json_file', and I don't understand why. I have previously worked through the tutorial at https://carpentries-incubator.github.io/workflows-snakemake/index.html.

The difference between my workflow and the tutorial workflow is that I want to create the data I use later in the first step, whereas in the tutorial, the data was already there.

Workflow description in plain text:

  1. Clone a git repository to path {path}
  2. Run a script {script} on every single JSON files in the directory {path}/parsed/ in parallel to produce the aggregate result {result}
GIT_PATH = config['git_local_path']  # git/
PARSED_JSON_PATH = f'{GIT_PATH}parsed/'
GIT_URL = config['git_url']

# A single parsed JSON file
PARSED_JSON_FILE = f'{PARSED_JSON_PATH}{{json_file}}.json'

# Build a list of parsed JSON file names
PARSED_JSON_FILE_NAMES = glob_wildcards(PARSED_JSON_FILE).json_file

# All parsed JSON files
ALL_PARSED_JSONS = expand(PARSED_JSON_FILE, json_file=PARSED_JSON_FILE_NAMES)


rule all:
    input: 'result.json'

rule clone_git:
    output: directory(GIT_PATH)
    threads: 1
    conda: f'{ENVS_DIR}git.yml'
    shell: f'git clone --depth 1 {GIT_URL} {{output}}'

rule extract_json:
    input:
        cmd='scripts/extract_json.py',
        json_file=PARSED_JSON_FILE
    output: 'result.json'
    threads: 50
    shell: 'python {input.cmd} {input.json_file} {output}'

Running only clone_git works fine (if I set an all input of GIT_PATH).

Why do I get the error message? Is this because the JSON files don't exist when the workflow is started?

Also - I don't know if this matters - this is a subworkflow used with module.


Solution

  • What you need seems to be a checkpoint rule which is first executed and only then snakemake determines which .json files are present and runs your extract/aggregate functions. Here's an example adapted:

    I'm struggling to fully understand the file and folder structure you get after cloning your git repo. So I have fallen back to the best practices by Snakemake of using resources for downloaded and results for created files.

    You'll need to re-adjust those paths to match your case again:

    GIT_PATH = config["git_local_path"]  # git/
    GIT_URL = config["git_url"]
    
    checkpoint clone_git:
        output:
            git=directory(GIT_PATH),
        threads: 1
        conda:
            f"{ENVS_DIR}git.yml"
        shell:
            f"git clone --depth 1 {GIT_URL} {{output.git}}"
    
    
    rule extract_json:
        input:
            cmd="scripts/extract_json.py",
            json_file="resources/{file_name}.json",
        output:
            "results/parsed_files/{file_name}.json",
        shell:
            "python {input.cmd} {input.json_file} {output}"
    
    
    def get_all_json_file_names(wildcards):
    
        git_dir = checkpoints.clone_git.get(**wildcards).output["git"]
        file_names = glob_wildcards(
            "resources/{file_name}.json"
        ).file_name
        
        return expand(
                "results/parsed_files/{file_name}.json",
                file_name=file_names,
            )
    
    # Rule has checkpoint dependency: Only after the checkpoint is executed
    # the rule is executed which then evaluates the function to determine all
    # json files downloaded from the git repo
    rule aggregate:
        input:
            get_all_json_file_names
        output:
            "result.json",
        default_target: True
        shell:
            # TODO: Action which combines all JSON files
    

    edit: Moved the expand(...) from rule aggregate into get_all_json_file_names.