Search code examples
pythondirectivesnakemake

Why doesn't this snakemake script directive work


I'm creating a snakemake workflow. I've never used the run or script directives until yesterday. I had a run directive that worked just fine, but snakemake --lint complained it was too long and said I should create a separate script and use the script directive:

Lints for rule check_peak_read_len_overlap_params (line 90, /Users/rleach/PROJECT-local/ATACC/REPOS/ATACCompendium/workflow/rules/error_checks.smk):
    * Migrate long run directives into scripts or notebooks:
      Long run directives hamper workflow readability. Use the script or notebook directive instead. Note that the script or notebook directive does not involve boilerplate. Similar to run, you will have direct access to params, input, output, and wildcards.Only use the run directive for a
      handful of lines.
      Also see:
      https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#external-scripts
      https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#jupyter-notebook-integration

So I tried turning this:

rule check_peak_read_len_overlap_params:
    input:
        "results/QC/max_read_length.txt",
    output:
        "results/QC/parameter_validation.txt",
    params:
        frac_read_overlap=FRAC_READ_OVERLAP,
        min_peak_width=MAX_ARTIFACT_WIDTH + 1,
        summit_width=SUMMIT_FLANK_SIZE * 2,
    log:
        tail="results/QC/logs/check_peak_read_len_overlap_params.stderr",
    run:
        max_read_len = 0
        # Get the max read length from the input file
        infile = open(input[0])
        for line in infile:
            max_read_len = int(line.strip())
            # The file should only be one line
            break
        infile.close()

        max_peak_frac = params.min_peak_width / max_read_len
        max_summit_frac = params.summit_width / max_read_len

        if max_peak_frac < params.frac_read_overlap:
            raise ValueError(
                f"There exist reads (of length {max_read_len}) that, if "
                "mapped to the smallest allowed peak (of length "
                f"{params.min_peak_width}, based on a MAX_ARTIFACT_WIDTH "
                f"of {MAX_ARTIFACT_WIDTH}), would never be counted using a "
                f"FRAC_READ_OVERLAP of {params.frac_read_overlap}."
            )

        if max_summit_frac < params.frac_read_overlap:
            raise ValueError(
                f"There exist reads (of length {max_read_len}) that, if "
                f"mapped to any summit (of length {params.summit_width}), "
                "would never be counted using a FRAC_READ_OVERLAP of "
                f"{params.frac_read_overlap}."
            )

        with open(output[0], "w") as out:
            out.write("Parameters have validated successfully.")

into this:

rule check_peak_read_len_overlap_params:
    input:
        "results/QC/max_read_length.txt",
    output:
        "results/QC/parameter_validation.txt",
    params:
        frac_read_overlap=FRAC_READ_OVERLAP,
        max_artifact_width=MAX_ARTIFACT_WIDTH,
        summit_flank_size=SUMMIT_FLANK_SIZE,
    log:
        "results/QC/logs/parameter_validation.log",
    conda:
        "../envs/python3.yml"
    script:
        "scripts/check_peak_read_len_overlap_params.py"

(Note, I changed the parameters at the same time so I could manipulate them in the script instead of under the params directive.) At first, I simply pasted the run code into a function, and called that function. Then I tried tweaking the code to get some sort of log output. I tried adding a shebang at the top. I tried inporting snakemake (which the docs didn't mention). I tried a bunch of things, but nothing seemed to work. Currently, this is what I have in the script:

import sys


def check_peak_read_len_overlap_params(
    infile,
    outfile,
    log,
    frac_read_overlap,
    max_artifact_width,
    summit_flank_size,
):
    log_handle = open(log, "w")
    sys.stderr = sys.stdout = log_handle

    print(
        "Parameters:\n"
        f"Input: {infile}\n"
        f"Output: {outfile}\n"
        f"Log: {log}\n"
        f"\tFRAC_READ_OVERLAP = {frac_read_overlap}\n"
        f"\tMAX_ARTIFACT_WIDTH = {max_artifact_width}\n"
        f"\tSUMMIT_FLANK_SIZE = {summit_flank_size}"
    )

    min_peak_width = max_artifact_width + 1
    summit_width = summit_flank_size * 2
    max_read_len = 0
    inhandle = open(infile)
    for line in inhandle:
        max_read_len = int(line.strip())
        # The file should only be one line
        break
    inhandle.close()

    max_peak_frac = min_peak_width / max_read_len
    max_summit_frac = summit_width / max_read_len

    if max_peak_frac < frac_read_overlap:
        raise ValueError(
            f"There exist reads (of length {max_read_len}) that, if "
            "mapped to the smallest allowed peak (of length "
            f"{min_peak_width}, based on a MAX_ARTIFACT_WIDTH "
            f"of {max_artifact_width}), would never be counted using a "
            f"FRAC_READ_OVERLAP of {frac_read_overlap}."
        )

    if max_summit_frac < frac_read_overlap:
        raise ValueError(
            f"There exist reads (of length {max_read_len}) that, if "
            f"mapped to any summit (of length {summit_width}), "
            "would never be counted using a FRAC_READ_OVERLAP of "
            f"{frac_read_overlap}."
        )

    with open(outfile, "w") as outhandle:
        outhandle.write("Parameters have validated successfully.")

check_peak_read_len_overlap_params(
    snakemake.input[0],
    snakemake.output[0],
    snakemake.log[0],
    snakemake.params.frac_read_overlap,
    snakemake.params.max_artifact_width,
    snakemake.params.summit_flank_size,
)

As I was working on this question, I figured out the issue, but I'll continue to post this and then answer, because nothing in the docs or via google searching, that I could find, answers this question...


Solution

  • When I tried adding a script directive, I literally copied this from the docs^:

        script:
            "scripts/script.py"
    

    ...then edited it.

    The problem was that the docs example is bad. If you follow their suggested directory structure:

    ├── .gitignore
    ├── README.md
    ├── LICENSE.md
    ├── workflow
    │   ├── rules
    |   │   ├── module1.smk
    |   │   └── module2.smk
    │   ├── envs
    |   │   ├── tool1.yaml
    |   │   └── tool2.yaml
    │   ├── scripts
    |   │   ├── script1.py
    |   │   └── script2.R
    │   ├── notebooks
    |   │   ├── notebook1.py.ipynb
    |   │   └── notebook2.r.ipynb
    │   ├── report
    |   │   ├── plot1.rst
    |   │   └── plot2.rst
    |   └── Snakefile
    ├── config
    │   ├── config.yaml
    │   └── some-sheet.tsv
    ├── results
    └── resources
    

    The path to the script has to be:

        script:
            "../scripts/script.py"
    

    Once I added that, it started working (or rather, I got to the next error).

    ^ snakemake Version 7.25.0 is the version I'm running, but it's the same as the latest.