Search code examples
pythonsnakemake

Snakemake rule is not picked up and can not specify the output files


I have a folder where the outputs of the rule are generated. I am having a real trouble running snakemake with it. If I do not specify the outputs in rule all, the rule (called neo4j) is not run at all. If I try running it manually with snakemake neo4j (which I would prefer not to), then I get an error:

WorkflowError: Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

I tried specifying the outputs of the rule in different ways but none of them worked.

  1. Using expand:

    expand('results/neo4j/{sample}/cells.csv', sample=samples),
    expand('results/neo4j/{sample}/genes.csv', sample=samples),
    expand('results/neo4j/{sample}/cl_nodes.csv', sample=samples),
    expand('results/neo4j/{sample}/cl_contains.csv', sample=samples),
    expand('results/neo4j/{sample}/cl_isin.csv', sample=samples),
    expand('results/neo4j/{sample}/expr_by.csv', sample=samples),
    expand('results/neo4j/{sample}/expr_ess.csv', sample=samples)
    

Generates a very weird error for a completely different unrelated rule (called umap):

Missing input files for rule umap: data_files/normalized/minus_2/cl_nodes.csv.csv

The path generation is completely messed up even though the folders are not connected in any way except for the results being the root folder of all of the outputs.

  1. Using dynamic:

    dynamic('results/neo4j/{sample}/cells.csv', sample=samples),
    dynamic('results/neo4j/{sample}/genes.csv', sample=samples),
    dynamic('results/neo4j/{sample}/cl_nodes.csv', sample=samples),
    dynamic('results/neo4j/{sample}/cl_contains.csv', sample=samples),
    dynamic('results/neo4j/{sample}/cl_isin.csv', sample=samples),
    dynamic('results/neo4j/{sample}/expr_by.csv', sample=samples),
    dynamic('results/neo4j/{sample}/expr_ess.csv', sample=samples)
    

Gives an error:

dynamic() got an unexpected keyword argument 'sample'

Ok, I tried removing sample=samples but no luck

  1. Just directory:

    directory('results/neo4j/{sample}/', sample=samples)
    

Gives error:

directory() got an unexpected keyword argument 'sample'

If I omit sample=samples, not working either. If I specify directory under rule all output, not working.

The rule I am having difficulty with is below:

rule neo4j:
    input:
        script = 'python/neo4j.py',
        path_to_cl = 'results/clusters/umap/{sample}_umap_clusters.csv',
        path_to_umap = 'results/umap/{sample}_umap.csv',
        path_to_mtx = 'data_files/normalized/{sample}.csv'
    output:
        base_neo4j = 'results/neo4j/{sample}'
    shell:
        "python {input.script} -path_to_cl {input.path_to_cl} -path_to_umap {input.path_to_umap} -path_to_mtx {input.path_to_mtx} -base_neo4j {output.base_neo4j}"

snakemake version is 5.2.2

Any suggestions would be greatly appreciated.

Update

I modified the Snakemake file using suggestions of Mali Akmanalp and now rule all looks like that:

samples,=glob_wildcards('data_files/normalized/{sample}.csv')
   rule all:
     input:
        expand('results/pca/img/{sample}_pca.png', sample=samples),
        expand('results/pca/{sample}_pca.csv', sample=samples),
        expand('results/tsne/{sample}_tsne.csv', sample=samples),
        expand('results/umap/{sample}_umap.csv', sample=samples),
        expand('results/umap/img/{sample}_umap.png', sample=samples),
        expand('results/tsne/img/{sample}_tsne.png', sample=samples),
        expand('results/clusters/umap/{sample}_umap_clusters.csv', sample=samples),
        expand('results/clusters/tsne/{sample}_tsne_clusters.csv', sample=samples),
        expand('results/neo4j/{sample}/{file}', sample=samples,    
          file=['cells.csv', 'genes.csv', 'cl_contains.csv', 'cl_isin.csv', 'cl_nodes.csv', 'expr_by.csv', 'expr_ess.csv'])

and neo4j rule like that:

rule neo4j:
    input:
        script = 'python/neo4j.py',
        path_to_cl = 'results/clusters/umap/{sample}_umap_clusters.csv',
        path_to_umap = 'results/umap/{sample}_umap.csv',
        path_to_mtx = 'data_files/normalized/{sample}.csv',
        base_neo4j = 'results/neo4j/{sample}'
    output: 'results/neo4j/{sample}/cells.csv', 'results/neo4j/{sample}/genes.csv', 'results/neo4j/{sample}/cl_nodes.csv',
            'results/neo4j/{sample}/cl_contains.csv', 'results/neo4j/{sample}/expr_by.csv', 'results/neo4j/{sample}/expr_ess.csv',
            'results/neo4j/{sample}/cl_isin.csv'
    shell:
        "python {input.script} -path_to_cl {input.path_to_cl} -path_to_umap {input.path_to_umap} -path_to_mtx {input.path_to_mtx} -base_neo4j {input.base_neo4j}"

With such set ups I am getting the error:

Missing input files for rule neo4j: results/neo4j/plus_1

Update

I removed this line from neo4j rule: base_neo4j = 'results/neo4j/{sample}' and then changed the output of the rule to:

 output: 
      cells = 'results/neo4j/{sample}/cells.csv', 
      genes = 'results/neo4j/{sample}/genes.csv', 
      cl_nodes = 'results/neo4j/{sample}/cl_nodes.csv',
      cl_contains = 'results/neo4j/{sample}/cl_contains.csv', 
      cl_isin = 'results/neo4j/{sample}/cl_isin.csv', 
      expr_by = 'results/neo4j/{sample}/expr_by.csv',
      expr_ess = 'results/neo4j/{sample}/expr_ess.csv'

and the shell command:

shell:
   "python {input.script} -path_to_cl {input.path_to_cl} -path_to_umap {input.path_to_umap} -path_to_mtx {input.path_to_mtx} -cells {output.cells} -genes {output.genes} -cl_nodes {output.cl_nodes} -cl_contains {output.cl_contains} -cl_isin {output.cl_isin} -expr_by {output.expr_by} -expr_ess {output.expr_ess}"

I do not like feeding in each parameter in the output but it is not working otherwise. I tried feeding in just output but it only feeds in the first item in the output, others are ignored for some reason. I asked a separate question regarding that:

Snakemake passes only the first path in the output to shell command

Other than that, it is working now.


Solution

  • It's not very easy to diagnose the full issue since you haven't provided the whole Snakefile, but here is what I can infer from what you specified:

    The error message is unfortunately a bit misleading, but the gist of it is that snakemake starts from a list of targets. These targets are either files you specified through the command line, or files that are the input of the topmost rule of a snakefile. Usually this rule is named "all" or "main". Here you would specify the final list of files to be generated by default. An example for your case would be:

    rule all:
        input: expand('results/neo4j/{sample}/{file}.csv', sample=samples, file=['cells.csv', 'genes.csv', ...])
    
    rule neo4j:
        ...
        output:'results/neo4j/{sample}/cells.csv', 'results/neo4j/{sample}/genes.csv'...
    

    Snakemake looks at the input of main to figure out all the files to be generated, then figures out what rule(s) to run (neo4j) with which parameters, in order to generate those, and what rules to use to generate the inputs of those rules, etc etc. So at the end of the day the very last rule, i.e. the "target rule" all is where everything starts, so you can't use wildcards there.

    Notice that the output for neo4j is just wildcards (they have {} in them and refer to a hypothetical pattern that may match a file), versus the input for all is expanded to a concrete file names (like 'results/neo4j/123/cells.csv').

    Often the way people get this error is that they don't have an all rule on the top of their snakefile, which leads snakemake to pick whatever other rule is at the top as the target, which happens to be a rule that has a wildcard.

    You probably shouldn't need dynamic / directory / etc for something like this.