I am trying to run a Snakefile
, which I checked that works for a small number of files, but it keeps giving me this error when I try to run it using a bigger number of input files:
Building DAG of jobs...
Killed
As a clarification, I have 726 protein files and 19634 hmm files.
The snakefile
looks like this:
ARCHIVE_FILE = 'output.tar.gz'
# a single output file
OUTPUT_FILE = 'output_{hmm}/{species}_{hmm}.out'
# a single input file
INPUT_FILE = 'proteins/{species}.fasta'
# a single hmm file
HMM_FILE = 'hmm/{hmm}.hmm'
# a single cat file
CAT_FILE = 'cat/cat_{hmm}.txt'
# a single lines file
LINE_FILE = 'lines/lines_{hmm}.txt'
# a single bit file
BIT_FILE = 'bit_scores/bit_{hmm}.txt'
# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species
# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm
# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)
# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)
# The list of all lines files
LINE = expand(LINE_FILE, hmm=HMM)
# The list of all lines files
BIT = expand(BIT_FILE, hmm=HMM)
# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
input: ARCHIVE_FILE
# hmmsearch
rule hmm:
input:
species=INPUT_FILE ,
hmm=HMM_FILE
output:
OUTPUT_FILE,
params:
cmd='hmmsearch --noali -E 99 --tblout'
shell:
'{params.cmd} {output} {input.hmm} {input.species} '
# concatenate output per hmm
rule concatenate:
input:
expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
output:
CAT_FILE,
params:
cmd="cat",
shell:
"{params.cmd} {input} > {output} "
# clean cat files
rule clean_cats:
input:
cmd='/home/agalvez/bin/remove_lines_starting_with_#.pl',
values=CAT_FILE
output: LINE_FILE
shell:
'{input.cmd} -input {input.values} -output {output}'
# create an archive with all results
rule create_archive:
input: OUT, CAT, LINE,
output: ARCHIVE_FILE
shell: 'tar -czvf {output} {input}'
Does anyone know how to solve this problem?
I think you can concatenate all the protein sequences into a single fasta file and run that against the hmm profiles. In this way you have 19634 jobs instead of 19634 x 726. But I think you can also combine the hmm profiles into a single file and have a single hmmsearch
job.
Besides, even if you succeed running snakemake the way you plan, working with 14M files is going to be terrible. I don't know... but I feel what you are trying to do, running many proteins against many profiles, is not unusual but you are making things more complicated than necessary.