Search code examples
pythonshellglobsnakemake

Is the use of * allowed in Snakemake?


I need to concatenate certain files inside some directories that have been created using a wildcard in a Snakefile. I tried to create the following rule to concatenate all files inside these directories:

# concatenate output per hmm
rule concatenate:
    input:
        output_{hmm}/* ,
    output: 
        output_{hmm}/cat_{hmm}.txt,
    params:
        cmd='cat'
    shell: 
        '{params.cmd} {input} > {output} '

It doesn't work and produced the following error:

"SyntaxError in line 62 of /scratch/data1/agalvez/domains/Snakefile_ecdf:
invalid syntax (Snakefile_ecdf, line 62)"

I do not know what is wrong with the rule and I suppose that the use of * might not be adequate, but I cannot think of another way to do what I intend to.

EDIT: The question might be lacking some information so I will also attach the complete Snakefile:

ARCHIVE_FILE = 'output.tar.gz'

# a single output file
OUTPUT_FILE = 'output_{hmm}/{species}_{hmm}.out'

# a single input file
INPUT_FILE = 'proteins/{species}.fasta'

# a single hmm file
HMM_FILE = 'hmm/{hmm}.hmm'

# a single cat file
CAT_FILE = 'cat/cat_{hmm}.txt'

# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species

# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm

# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)

# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)

# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
    input: ARCHIVE_FILE

# hmmsearch
rule hmm:
    input:
        species=INPUT_FILE ,
        hmm=HMM_FILE
    output: 
        OUTPUT_FILE,
    params:
        cmd='hmmsearch --noali -E 99 --tblout'
    shell: 
        '{params.cmd} {output} {input.hmm} {input.species} '

# concatenate output per hmm
from glob import glob

rule concatenate:
    input:
        files = glob("output_{hmm}/*") ,
    output: 
        CAT_FILE,
    params:
        cmd='cat'
    shell: 
        '{params.cmd} {input.files} {output} '

# create an archive with all results
rule create_archive:
    input: OUT, CAT,
    output: ARCHIVE_FILE
    shell: 'tar -czvf {output} {input}'


Solution

  • For testing purposes, let's create a sample file set (so operating with dummy files to make sure the workflow works). In the terminal, I ran:

    mkdir proteins && touch proteins/1.fasta proteins/2.fasta
    mkdir hmm && touch hmm/A.hmm hmm/B.hmm
    

    Now, your workflow is mostly correct, except for the rule concatenate. The inputs for this rule are created by rule hmm, and the output of this rule is specific to wildcard value of hmm. So you are interested in all values for species for a given hmm. The way to get that is to use expand but keep hmm in wildcard format, using expand(OUTPUT_FILE, species=INP, hmm="{hmm}"):

    rule concatenate:
        input:
            expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
        output:
            CAT_FILE,
        params:
            cmd="cat",
        shell:
            "{params.cmd} {input} > {output} "
    

    In the workflow below I modified rule hmm for quick test run, so the complete workflow will look like this:

    ARCHIVE_FILE = "output.tar.gz"
    
    # a single output file
    OUTPUT_FILE = "output_{hmm}/{species}_{hmm}.out"
    
    # a single input file
    INPUT_FILE = "proteins/{species}.fasta"
    
    # a single hmm file
    HMM_FILE = "hmm/{hmm}.hmm"
    
    # a single cat file
    CAT_FILE = "cat/cat_{hmm}.txt"
    
    # Build the list of input files.
    INP = glob_wildcards(INPUT_FILE).species
    
    # Build the list of hmm files.
    HMM = glob_wildcards(HMM_FILE).hmm
    
    # The list of all output files
    OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
    
    # The list of all CAT files
    CAT = expand(CAT_FILE, hmm=HMM)
    
    
    # pseudo-rule that tries to build everything.
    # Just add all the final outputs that you want built.
    rule all:
        input:
            ARCHIVE_FILE,
    
    
    # hmmsearch
    rule hmm:
        input:
            species=INPUT_FILE,
            hmm=HMM_FILE,
        output:
            touch(OUTPUT_FILE),
        params:
            #        cmd='hmmsearch --noali -E 99 --tblout'
            cmd="echo",
        shell:
            "{params.cmd} {output} {input.hmm} {input.species} "
    
    
    rule concatenate:
        input:
            expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
        output:
            CAT_FILE,
        params:
            cmd="cat",
        shell:
            "{params.cmd} {input} > {output} "
    
    
    # create an archive with all results
    rule create_archive:
        input:
            CAT,
        output:
            ARCHIVE_FILE,
        shell:
            "tar -czvf {output} {input}"