I need to concatenate certain files inside some directories that have been created using a wildcard in a Snakefile
. I tried to create the following rule to concatenate all files inside these directories:
# concatenate output per hmm
rule concatenate:
input:
output_{hmm}/* ,
output:
output_{hmm}/cat_{hmm}.txt,
params:
cmd='cat'
shell:
'{params.cmd} {input} > {output} '
It doesn't work and produced the following error:
"SyntaxError in line 62 of /scratch/data1/agalvez/domains/Snakefile_ecdf:
invalid syntax (Snakefile_ecdf, line 62)"
I do not know what is wrong with the rule and I suppose that the use of *
might not be adequate, but I cannot think of another way to do what I intend to.
EDIT: The question might be lacking some information so I will also attach the complete Snakefile:
ARCHIVE_FILE = 'output.tar.gz'
# a single output file
OUTPUT_FILE = 'output_{hmm}/{species}_{hmm}.out'
# a single input file
INPUT_FILE = 'proteins/{species}.fasta'
# a single hmm file
HMM_FILE = 'hmm/{hmm}.hmm'
# a single cat file
CAT_FILE = 'cat/cat_{hmm}.txt'
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input: ARCHIVE_FILE
# hmmsearch
rule hmm:
input:
species=INPUT_FILE ,
hmm=HMM_FILE
output:
OUTPUT_FILE,
params:
cmd='hmmsearch --noali -E 99 --tblout'
shell:
'{params.cmd} {output} {input.hmm} {input.species} '
# concatenate output per hmm
from glob import glob
rule concatenate:
input:
files = glob("output_{hmm}/*") ,
output:
CAT_FILE,
params:
cmd='cat'
shell:
'{params.cmd} {input.files} {output} '
# create an archive with all results
rule create_archive:
input: OUT, CAT,
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'
For testing purposes, let's create a sample file set (so operating with dummy files to make sure the workflow works). In the terminal, I ran:
mkdir proteins && touch proteins/1.fasta proteins/2.fasta
mkdir hmm && touch hmm/A.hmm hmm/B.hmm
Now, your workflow is mostly correct, except for the rule concatenate
. The inputs for this rule are created by rule hmm
, and the output of this rule is specific to wildcard value of hmm
. So you are interested in all values for species
for a given hmm
. The way to get that is to use expand
but keep hmm
in wildcard format, using expand(OUTPUT_FILE, species=INP, hmm="{hmm}")
:
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
In the workflow below I modified rule hmm
for quick test run, so the complete workflow will look like this:
ARCHIVE_FILE = "output.tar.gz"
# a single output file
OUTPUT_FILE = "output_{hmm}/{species}_{hmm}.out"
# a single input file
INPUT_FILE = "proteins/{species}.fasta"
# a single hmm file
HMM_FILE = "hmm/{hmm}.hmm"
# a single cat file
CAT_FILE = "cat/cat_{hmm}.txt"
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input:
ARCHIVE_FILE,
# hmmsearch
rule hmm:
input:
species=INPUT_FILE,
hmm=HMM_FILE,
output:
touch(OUTPUT_FILE),
params:
# cmd='hmmsearch --noali -E 99 --tblout'
cmd="echo",
shell:
"{params.cmd} {output} {input.hmm} {input.species} "
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
# create an archive with all results
rule create_archive:
input:
CAT,
output:
ARCHIVE_FILE,
shell:
"tar -czvf {output} {input}"