python automation pipeline bioinformatics snakemake

MD task with `snakemake`

I want to create a very simple pipeline for molecular dynamic simulation. The program (Amber) just wants 3 files as input, and produces a lot of files, some of them I will be needed in the future. So my pipeline is extremely simple:

Check that *.in, *.prmtop and *.rst are in folder (I guarantee it's only one file for any of these extensions) and warn if these files are not present
Run shell command (based on name of all input files)
Check that *.out, mden, mdinfo and *.nc were produced

That's all. It's standard approach to the program I deal with. One folder, one task, short and simple file names based on file purpose, not on its content.

I wrote a simple pipline:

rule all:
    input: '{inp}.out'

rule amber:
    input:
        '{inp}.in',
        '{top}.prmtop',
        '{coord}.rst'
    output:
        '{inp}.out',
        'mden',
        'mdinfo',
        '{inp}.nc'
    shell:
        'pmemd.cuda'
        ' -O'
        ' -i {inp}.in'
        ' -o {inp}.out'
        ' -p {top}.prmtop'
        ' -c {coord}.rst'
        ' -r {inp}.rst'
        ' -x {inp}.nc'
        ' -ref {coord}.rst'

And it doesn't work.

All inputs in all rule must be explicit. (Why? Why it cannot be regex or wildcard expression? If I see *.out in folder and status code of shell script was 0, that's all, work is done)
I must to use all wildcards from input in output, but I want to use some only in shell or in another rules
I must to not expect to get files like mden with potentially "non-unique" names because it's could be change with another task, but I know, that it will be only one task and it's a direct way how my MD program works (yeah, I know about Ambers's -e and -inf keys, but it's over-complication of simple task).

So, I would like to decide is it worth using snakemake for this, or not. It's very simple task, but I already spent several hours, I see a lot of documentation, a lot of examples, that I can't apply to my case. snakemake looks exactly what I need, but I can't express simple task in general terms with this framework, I don't want to specify explicit filenames, because I'll lose in flexibility, I want to run hundreds of simple tasks automatically, only input files will be different. I'm sure I just haven't figured out how to handle this framework yet. Maybe you can show me how should I? Thank you!

Solution

Hopefully this will put you in the right direction.

If I understand correctly, the input to snakemake is a folder containing the input files to amber. You know that this folder contains one .in file, one .prmtop file, and one .rst file but you don't know the full names of these files.

If you want snakemake to run on a single input folder, then you don't need wildcards at all and the script below should do.

import glob
import os

input_folder = config['amber_folder']

# We don't know the name of input file. We only know it ends in '.in'
inp = glob.glob(os.path.join(input_folder, '*.in'))
assert len(inp) == 1
inp = inp[0]

name = os.path.splitext(os.path.basename(inp))[0]
output_folder = name + '_results'

out = os.path.join(output_folder, name + '.out')

rule all:
    input:
        out

rule amber:
    input:
        inp= inp,
        top= glob.glob(os.path.join(input_folder, '*.prmtop')),
        rst= glob.glob(os.path.join(input_folder, '*.rst')),
    output:
        out= out,
        nc= os.path.join(output_folder, name + '.nc'),
        mden= os.path.join(output_folder, 'mden'),
        mdinfo= os.path.join(output_folder, 'mdinfo'),
    shell:
        r"""
        pmemd.cuda \
         -O \
         -i {input.inp} \
         -o {output.out} \
         -p {input.top} \
         -c {input.rst} \
         -r {input.rst} \
         -x {output.nc} \
         -ref {input.rst}
        """

Execute with:

snakemake -j 1 -C amber_folder='your-input-folder'

If you have many input folders you could write a for-loop to execute the command above but probably better is to pass the list of inputs to snakemake and let it handle them.