Snakemake multiple wildcards and argparse arguments

I am new to snakemake and finding it very difficult to do simplest of things it can do. For illustration, I have written a program adding_text.py that takes arguments (argparse) of an input directory, an output directory and index (from os.listdir of the input directory) to process some text files.

This is my file structure:

identity_category1  
|----A.txt -> text A identity  
|----B.txt -> text B identity  
|----C.txt -> text C identity  
identity_category2  
|----P.txt -> text P identity  
|----Q.txt -> text Q identity  
|----R.txt -> text R identity  
identity_category3  
|----X.txt -> text X identity  
|----Y.txt -> text Y identity  
|----Z.txt -> text Z identity

And this is my code adding_text.py:

import argparse
import os
my_parser = argparse.ArgumentParser(usage='python %(prog)s [-h] input_dir output_dir file_index')
my_parser.add_argument('input_dir', type=str)
my_parser.add_argument('output_dir', type=str)
my_parser.add_argument('file_index', type=int)
args = my_parser.parse_args()

input_dir = args.input_dir
output_dir = args.output_dir
file_index = args.file_index
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

filelist = os.listdir(input_dir)
input_file = open(os.path.join(input_dir, filelist[file_index]), 'r')
output_file = open(os.path.join(output_dir, filelist[file_index].split('.')[0] + '_added.txt'), 'w')
output_file.write(input_file.read() + ' has been added\n')

All I am doing is firing the following commands at console:

python adding_text.py identity_category1 1_added 0
python adding_text.py identity_category1 1_added 1
python adding_text.py identity_category1 1_added 2
python adding_text.py identity_category2 2_added 0
python adding_text.py identity_category2 2_added 1
python adding_text.py identity_category2 2_added 2
python adding_text.py identity_category3 3_added 0
python adding_text.py identity_category3 3_added 1
python adding_text.py identity_category3 3_added 2

And get the following output (structure):

1_added
|----A_added.txt -> text A identity has been added
|----B_added.txt -> text B identity has been added
|----C_added.txt -> text C identity has been added
2_added
|----P_added.txt -> text P identity has been added
|----Q_added.txt -> text Q identity has been added
|----R_added.txt -> text R identity has been added
3_added
|----X_added.txt -> text X identity has been added
|----Y_added.txt -> text Y identity has been added
|----Z_added.txt -> text Z identity has been added

So the python coding isnt the problem. The problem is when I am trying to design a snakemake workflow around the problem, involving multiple wildcards, dependencies etc. My possible_snakefile looks like this

NUM = ["1", "2", "3"]
SAMPLE = ["A", "B", "C"]

rule add_text:
    input: 
        expand("identity_category{num}/{sample}.txt", num=NUM, sample=SAMPLE)
    output: 
        expand("{num}_added/{sample}_added.txt", num=NUM, sample=SAMPLE)
    run:
        for index in range(0,3):
            shell("python adding_text.py identity_category{num} {num}_added index")

When I try to specify a target and perform a dry run via snakemake --cores 1 -n -s possible_snakefile 1_added/A_added.txt , it incorrectly maps input directories and respective files and throws me this error:

MissingInputException in line 4 possible_snakefile:
Missing input files for rule add_text:
identity_category3/C.txt
identity_category2/A.txt
identity_category3/B.txt
identity_category2/B.txt
identity_category2/C.txt
identity_category3/A.txt

I am sure its very simple, but I am not just able to get my head around it. i.e. different wildcard specification in possible_snakefile or specifying different target files at command line. I would appreciate help here. Thank you

Solution

First of all your design is not very good as it relies on the order of filenames. That means that if you add one more file into the identity_category{num} directory, the result would change. That complicates the pipeline, makes it less predictable, and I'd advise you to rework the script and make the dependencies explicit. Anyway, in the rest of my answer I would assume that the script is something that you cannot change.

You need to specify a target: the file (or a group of files or directories) that the pipeline shall produce. This target shall have no wildcards, as the target shall be explicit. Using your script it is not so obvious what the target is, but you may specify the group of {num}_added directories what you plan to get from the pipeline:

rule target:
    input:
        expand("{num}_added", num=NUM)

Note that the {num} here is not a wildcard, as it is fully resolved in the expand function. Actually this function would return a list of three elements: ["1_added", "2_added", "3_added"], and Snakemake would know what to produce:

rule target:
    input:
        ["1_added", "2_added", "3_added"]

In addition note that the name target is arbitrary, but this has to be the topmost rule in your Snakefile.

Ok, now Snakemake knows that it needs to produce 3 objects, and you can instruct it how to produce each of them:

rule make_added:
    input:
        "identity_category{num}"
    output:
        "{num}_added"
    ...
    # some magic would come here later

This rule instructs Snakemake that to produce a single {num}_added directory it needs another directory identity_category{num} where the {num} has to match. The {num} here is a wildcard, Snakemake would substitute it's value automatically, and it would run this rule 3 times (actually len(NUM) times).

Now let's call your script:

rule make_added:
    input:
        "identity_category{num}"
    output:
        "{num}_added"
    run:
        for index in range(0, 3):
            shell("python adding_text.py identity_category{wildcards.num} {wildcards.num}_added {index}")

Here you cannot name the wildcard simply by name. Moreover, you need to put the variable index into braces.