Snakemake hangs when trying to expand large number of files

I'm trying to use Snakemake to read in a large english corpus of .txt files and run a python script on them, but it seems to completely hang when I run it - I've left it for quite a while with no response, whereas the actual script takes a very small amount of time to run.

Here is my current Snakefile:

raw_dirs, raw_files = glob_wildcards("../../my_data/{dir}/{id}.txt")


rule p_tag:
    input:
        protected(expand("../../my_data/{dir}/{id}.txt", dir = raw_dirs, id = raw_files))
    output:
        expand("../../my_data/tagged/{dir}/{id}.txt", dir = raw_dirs, id = raw_files)
    script:
        "ml/pos_tag.py"

Solution

You probably don't want a straight expand here as that will produce the product of each dir/id pair. Pass in 'zip' as the second argument to expand to just generate the dir/id pairs that exist.

If it's hanging on the globbing, you can also include wildcard constraints to help the regex engine.

Finally, I'm not sure what your script is doing, but it may help to have your rule handle one file at a time instead of taking all inputs/outputs.

Edit to expand on the final point: Your current rule is taking all inputs and all outputs and providing them to the script. Let's say ml/pos_tag.py does something like:

for infile, outfile in zip(snakemake.input, snakemake.output):
   # do work on infile and store in outfile

Change that script to work on a single infile to produce an outfile. (This is assuming the files are independent, if you actually need all the input files to make your outputs this isn't right.)

# do work on snakemake.input[0] and store in snakemake.output[0]

Then your snakefile becomes:

raw_dirs, raw_files = glob_wildcards("../../my_data/{dir}/{id}.txt")

rule all:
    input:
        expand("../../my_data_tagged/{dir}/{id}.txt",
               zip, dir=raw_dirs, id=raw_files)

rule p_tag:
    input:
        "../../my_data/{dir}/{id}.txt"
    output:
        "../../my_data_tagged/{dir}/{id}.txt"
    script:
        "ml/pos_tag.py"

The main advantage is you can have snakemake parallelize the code instead of doing it in python.

I made the change of adding zip into the expand in all and removed the protected marking, which is only valid on outputs. Finally, I stored the outputs in a new directory as otherwise subsequent runs will match to outputs:

"../../my_data/tagged/d1/id10.txt"
#              ^  dir  ^ ^id^