Search code examples
snakemake

Snakemake with 3 different programs, 2 input-directorys and one pre-defined .txt file with desired outputs


This will be a bit of a demanding question but death starring the sankemake doku so far has not yielded the desired result and I hope someone more experienced can walk me through.

Let me describe what I have:

1: Three Programs on the Shell, which execute like this:

./program_1 $(cat file/from/dir_1/foo.fa) $(cat file/from/dir_2/bar.fa) > output.txt
./program_2 $(cat file/from/dir_1/foo.fa) $(cat file/from/dir_2/bar.fa) > output.txt
./program_3 $(cat file/from/dir_1/foo.fa) $(cat file/from/dir_2/bar.fa) > output.txt

Yes the cat is necessary, they use strings from within the files, not files themselves.


2: Two directories that look something like this:

hsa-let-7f-5p.fa
hsa-let-7g-5p.fa
hsa-miR-100-3p.fa
hsa-miR-100-5p.fa
hsa-miR-101-2-5p.fa

...(more)

NM_000044.fa
NM_000059.fa
NM_000075.fa
NM_000088.fa
NM_000103.fa

...(more)


3: A .txt file with all desired output combinations:

hsa-miR-29b-3p__NM_138473_programm_1.txt
hsa-miR-29b-3p__NM_138473_programm_2.txt
hsa-miR-29b-3p__NM_138473_programm_3.txt
hsa-miR-545-3p__NM_002332_programm_1.txt
hsa-miR-545-3p__NM_002332_programm_2.txt
hsa-miR-545-3p__NM_002332_programm_3.txt

...(more)


Not all files from the two directories are used, some are used multiple times, I dont want all combination, just the once specified. The output files should be separate and named according to the above specified .txt. The final sankefile should parallelize nicely on a cluster.


The workflow is pretty simple in words:

1.Read the output file names.
2.Retrieve the required inputs for the combinations from both directories.
3.Run them on all 3 programs and output the pre-defined txt. by piping the program stdout into a file.

Done


But well ... how to tell that to Snakemake? I was told this is conveniently possible but not luck so far. If there are question pls ask. Thank You in Advance (:



I adjusted the code below to my best knowledge for the situation. I'm reading out from the file now, the "programs" on top and "dirs" in "rule one" have their full paths. I also switched the two inputs in the shell command because I was confused, and they go this way. Yea, I know the file names are suboptimal.

out = []
f = open("snakemake_output_small.txt", "r")
for line in f:
    out.append(line.replace('\n', ''))   

# Get distinct filenames
hsa = set([x.split('__')[0] for x in out])
nm = [x.split('__')[1] for x in out]
nm = set([re.sub('_program_.*', '', x) for x in nm])
program = ['full/path/to/program_1', 'full/path/to/program_2', 'full/path/to/program_3']

# Force snakemake to use exactly these wildcard values
# i.e. do not match by regex
wildcard_constraints:
    hsa= '|'.join([re.escape(x) for x in hsa]),
    nm= '|'.join([re.escape(x) for x in nm]),
    program= '|'.join([re.escape(x) for x in program]),


rule all:
    input:
        out,


rule one:
    input:
        hsa= 'full/path/to/dir1/{hsa}.fa',
        nm= 'full/path/to/dir2/{nm}.fa',
    output:
        '{hsa}__{nm}_{program}.txt',
    shell:
        r"""
        {wildcards.program} {input.nm} {input.hsa} > {output}
        """

Now it is telling me:

Building DAG of jobs...
MissingInputException in line 21 of Snakefile:
Missing input files for rule all:
hsa-miR-545-3p__NM_002332_program_3.txt
hsa-miR-29b-3p__NM_138473_program_1.txt
hsa-miR-29b-3p__NM_138473_program_1.txt
hsa-miR-545-3p__NM_002332_program_2.txt
hsa-miR-29b-3p__NM_138473_program_3.txt
hsa-miR-545-3p__NM_002332_program_2.txt

This is 100% a user Error, but what am I missing. Input dir is correct if i'm not totally blind.


I also forgot to mention, that two of the program have their own parameters that have to be included. A -d for program 2. and a -P here/a/parameter/file for program 3. so the shell commands might need to be separated. Sry this is all very messy.

Like before if there are questions, please ask .


Solution

  • I would prefer to put the output filenames with the combinations of input and programs in a csv file, read it with pandas and use the dataframe to guide the workflow. Here instead I parse the output filenames to extract the relevant input, which is murky in my opinion.

    To accommodate options passed to each program use a dictionary or, better, a yaml configuration file and query this dictionary using the program wildcard:

    # Read this list from file:
    out= ['hsa-miR-29b-3p__NM_138473_program_1.txt',
        'hsa-miR-29b-3p__NM_138473_program_2.txt',
        'hsa-miR-29b-3p__NM_138473_program_3.txt',
        'hsa-miR-545-3p__NM_002332_program_1.txt',
        'hsa-miR-545-3p__NM_002332_program_2.txt',
        'hsa-miR-545-3p__NM_002332_program_3.txt']
    
    hsa = set([x.split('__')[0] for x in out])
    nm = [x.split('__')[1] for x in out]
    nm = set([re.sub('_program_.*', '', x) for x in nm])
    
    # Better to put this in a yaml file and read it with `--configfile progs.yaml`
    programs = {'program_1': 
                    {'path': '/path/to/program_1', 'opts': '-d foo -P bar'}, 
               'program_2': 
                    {'path': '/path/to/program_2', 'opts': '-P spam eggs'}, 
               'program_3': 
                    {'path': '/path/to/program_3', 'opts': ''}
               }
    
    wildcard_constraints:
        hsa= '|'.join([re.escape(x) for x in hsa]),
        nm= '|'.join([re.escape(x) for x in nm]),
        program= '|'.join([re.escape(x) for x in programs.keys()]),
    
    
    rule all:
        input:
            out,
    
    
    rule one:
        input:
            hsa= 'dir1/{hsa}.fa',
            nm= 'dir2/{nm}.fa',
        output:
            '{hsa}__{nm}_{program}.txt',
        params:
            path= lambda wc: programs[wc.program]['path'],
            opts= lambda wc: programs[wc.program]['opts'],
        shell:
            r"""
            {params.path} {params.opts} {input.hsa} {input.nm} > {output}
            """