Search code examples
pythonout-of-memoryhpcsnakemake

Snakemake process Kllled


I am trying to run a Snakefile, which I checked that works for a small number of files, but it keeps giving me this error when I try to run it using a bigger number of input files:

Building DAG of jobs...
Killed

As a clarification, I have 726 protein files and 19634 hmm files.

The snakefile looks like this:

ARCHIVE_FILE = 'output.tar.gz'

# a single output file
OUTPUT_FILE = 'output_{hmm}/{species}_{hmm}.out'

# a single input file
INPUT_FILE = 'proteins/{species}.fasta'

# a single hmm file
HMM_FILE = 'hmm/{hmm}.hmm'

# a single cat file
CAT_FILE = 'cat/cat_{hmm}.txt'

# a single lines file
LINE_FILE = 'lines/lines_{hmm}.txt'

# a single bit file
BIT_FILE = 'bit_scores/bit_{hmm}.txt'

# Build the list of input files.
INP = glob_wildcards(INPUT_FILE).species

# Build the list of hmm files.
HMM = glob_wildcards(HMM_FILE).hmm

# The list of all output files
OUT = expand(OUTPUT_FILE, species=INP, hmm=HMM)

# The list of all CAT files
CAT = expand(CAT_FILE, hmm=HMM)

# The list of all lines files
LINE = expand(LINE_FILE, hmm=HMM)

# The list of all lines files
BIT = expand(BIT_FILE, hmm=HMM)

# pseudo-rule that tries to build everything.
# Just add all the final outputs that you want built.
rule all:
    input: ARCHIVE_FILE

# hmmsearch
rule hmm:
    input:
        species=INPUT_FILE ,
        hmm=HMM_FILE
    output: 
        OUTPUT_FILE,
    params:
        cmd='hmmsearch --noali -E 99 --tblout'
    shell: 
        '{params.cmd} {output} {input.hmm} {input.species} '

# concatenate output per hmm
rule concatenate:
    input:
        expand(OUTPUT_FILE, species=INP, hmm="{hmm}"),
    output:
        CAT_FILE,
    params:
        cmd="cat",
    shell:
        "{params.cmd} {input} > {output} "

# clean cat files
rule clean_cats:
    input:
        cmd='/home/agalvez/bin/remove_lines_starting_with_#.pl',
        values=CAT_FILE
    output: LINE_FILE
    shell: 
        '{input.cmd} -input {input.values} -output {output}'

# create an archive with all results
rule create_archive:
    input: OUT, CAT, LINE,
    output: ARCHIVE_FILE
    shell: 'tar -czvf {output} {input}'

Does anyone know how to solve this problem?


Solution

  • I think you can concatenate all the protein sequences into a single fasta file and run that against the hmm profiles. In this way you have 19634 jobs instead of 19634 x 726. But I think you can also combine the hmm profiles into a single file and have a single hmmsearch job.

    Besides, even if you succeed running snakemake the way you plan, working with 14M files is going to be terrible. I don't know... but I feel what you are trying to do, running many proteins against many profiles, is not unusual but you are making things more complicated than necessary.