I am new to snakemake and finding it very difficult to do simplest of things it can do. For illustration, I have written a program adding_text.py
that takes arguments (argparse) of an input directory, an output directory and index (from os.listdir
of the input directory) to process some text files.
This is my file structure:
identity_category1
|----A.txt -> text A identity
|----B.txt -> text B identity
|----C.txt -> text C identity
identity_category2
|----P.txt -> text P identity
|----Q.txt -> text Q identity
|----R.txt -> text R identity
identity_category3
|----X.txt -> text X identity
|----Y.txt -> text Y identity
|----Z.txt -> text Z identity
And this is my code adding_text.py
:
import argparse
import os
my_parser = argparse.ArgumentParser(usage='python %(prog)s [-h] input_dir output_dir file_index')
my_parser.add_argument('input_dir', type=str)
my_parser.add_argument('output_dir', type=str)
my_parser.add_argument('file_index', type=int)
args = my_parser.parse_args()
input_dir = args.input_dir
output_dir = args.output_dir
file_index = args.file_index
if not os.path.exists(output_dir):
os.mkdir(output_dir)
filelist = os.listdir(input_dir)
input_file = open(os.path.join(input_dir, filelist[file_index]), 'r')
output_file = open(os.path.join(output_dir, filelist[file_index].split('.')[0] + '_added.txt'), 'w')
output_file.write(input_file.read() + ' has been added\n')
All I am doing is firing the following commands at console:
python adding_text.py identity_category1 1_added 0
python adding_text.py identity_category1 1_added 1
python adding_text.py identity_category1 1_added 2
python adding_text.py identity_category2 2_added 0
python adding_text.py identity_category2 2_added 1
python adding_text.py identity_category2 2_added 2
python adding_text.py identity_category3 3_added 0
python adding_text.py identity_category3 3_added 1
python adding_text.py identity_category3 3_added 2
And get the following output (structure):
1_added
|----A_added.txt -> text A identity has been added
|----B_added.txt -> text B identity has been added
|----C_added.txt -> text C identity has been added
2_added
|----P_added.txt -> text P identity has been added
|----Q_added.txt -> text Q identity has been added
|----R_added.txt -> text R identity has been added
3_added
|----X_added.txt -> text X identity has been added
|----Y_added.txt -> text Y identity has been added
|----Z_added.txt -> text Z identity has been added
So the python coding isnt the problem. The problem is when I am trying to design a snakemake workflow around the problem, involving multiple wildcards, dependencies etc. My possible_snakefile
looks like this
NUM = ["1", "2", "3"]
SAMPLE = ["A", "B", "C"]
rule add_text:
input:
expand("identity_category{num}/{sample}.txt", num=NUM, sample=SAMPLE)
output:
expand("{num}_added/{sample}_added.txt", num=NUM, sample=SAMPLE)
run:
for index in range(0,3):
shell("python adding_text.py identity_category{num} {num}_added index")
When I try to specify a target and perform a dry run via snakemake --cores 1 -n -s possible_snakefile 1_added/A_added.txt
, it incorrectly maps input directories and respective files and throws me this error:
MissingInputException in line 4 possible_snakefile:
Missing input files for rule add_text:
identity_category3/C.txt
identity_category2/A.txt
identity_category3/B.txt
identity_category2/B.txt
identity_category2/C.txt
identity_category3/A.txt
I am sure its very simple, but I am not just able to get my head around it. i.e. different wildcard specification in possible_snakefile
or specifying different target files at command line. I would appreciate help here. Thank you
First of all your design is not very good as it relies on the order of filenames. That means that if you add one more file into the identity_category{num}
directory, the result would change. That complicates the pipeline, makes it less predictable, and I'd advise you to rework the script and make the dependencies explicit. Anyway, in the rest of my answer I would assume that the script is something that you cannot change.
You need to specify a target: the file (or a group of files or directories) that the pipeline shall produce. This target shall have no wildcards, as the target shall be explicit. Using your script it is not so obvious what the target is, but you may specify the group of {num}_added
directories what you plan to get from the pipeline:
rule target:
input:
expand("{num}_added", num=NUM)
Note that the {num}
here is not a wildcard, as it is fully resolved in the expand
function. Actually this function would return a list of three elements: ["1_added", "2_added", "3_added"]
, and Snakemake would know what to produce:
rule target:
input:
["1_added", "2_added", "3_added"]
In addition note that the name target is arbitrary, but this has to be the topmost rule in your Snakefile.
Ok, now Snakemake knows that it needs to produce 3 objects, and you can instruct it how to produce each of them:
rule make_added:
input:
"identity_category{num}"
output:
"{num}_added"
...
# some magic would come here later
This rule instructs Snakemake that to produce a single {num}_added
directory it needs another directory identity_category{num}
where the {num}
has to match. The {num}
here is a wildcard, Snakemake would substitute it's value automatically, and it would run this rule 3 times (actually len(NUM)
times).
Now let's call your script:
rule make_added:
input:
"identity_category{num}"
output:
"{num}_added"
run:
for index in range(0, 3):
shell("python adding_text.py identity_category{wildcards.num} {wildcards.num}_added {index}")
Here you cannot name the wildcard simply by name. Moreover, you need to put the variable index
into braces.