I have written a snakemake pipeline for my project,part of which looks something like this:
SAMPLES, = glob_wildcards("/absolute/path/to/samples/{sample}.bam")
rule all:
input:
expand("splits/sample_check/{sample}_done.txt", sample=SAMPLES)
rule GVCFSplit:
input:
"gvcf/{SAMPLES}/",
#"chr_pos_test/chr{c}/chr{c}_reg{i}.txt"
output:
"splits/sample_check/{SAMPLES}_done.txt"
log:
"logs/GVCFSplit/{SAMPLES}_done.log"
benchmark:
"benchmarks/GVCFSplit/{SAMPLES}_done.benchmark.txt"
envmodules:
"bcftools"
resources:
mem='1g',
time='4:00:00',
threads=1
shell:
r"""
python3 /absolute/path/to/python/script/GVCF_split.py {wildcards.SAMPLES}
"""
The rule divides the per chromosome files into chunks of 50 Mb with the help of the python script below:
from pathlib import Path
import subprocess
from sys import argv
import os,sys
sample_id=argv[1].strip()
chrs=list(range(1,23))
for c in chrs:
sample_file=("/path/to/chromosome/files/per/sample/%s/%s_chr%i.g.vcf.gz") % (sample_id,sample_id,c)
for r in range(1,chr_reg[c]+1):
reg_file=("/path/to/per/chromosome/regions/chr%i/chr%i_reg%i.txt") % (c,c,r)
#out_file=("try/gvcf/splits/chr%i/%s_reg%i.g.vcf.gz") % (c,sample_id,r)
out_file=("/path/to/spiltted/vcf/files/chr%i/%s_reg%i.g.vcf.gz") % (c,sample_id,r)
#Path(out_file).touch()
proc = subprocess.run(["bcftools", "view", sample_file, "-Oz", "-o", out_file, "-R", reg_file])
result = proc.returncode
exit += result
if exit == 0:
Path("splits/sample_check/"+sample_id+"_done.txt").touch() #creates a file for snakemake to track the changes if everything went fine
sys.exit(0)
else:
sys.exit(1)
When I manually run the python script as:
python3 GVCF_split.py "sample_id"
it runs smoothly, but when I submit this snakemake file with --profile
to the cluster, the rules are submitted per sample as expected, but they fail immediately after they start running. The snakemake file keeps running after that and no error is thrown. Here is the config file I use with the --profile
flag:
cluster: mkdir -p slurm_snake/`basename {workflow.main_snakefile}`/{rule} &&
sbatch
--partition={resources.partition}
--cpus-per-task={resources.threads}
--mem={resources.mem}
--time={resources.time}
--job-name=smk-{rule}-{wildcards}
--output=try/slurm_snake/`basename {workflow.main_snakefile}`/{rule}/{rule}-{wildcards}-%j.out
default-resources:
- partition=main
- mem='4G'
- time="24:0:0"
- threads=1
restart-times: 0
max-jobs-per-second: 5
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 1000
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
I have the similar setup for my original snakemake file (this is a copy of it to try a few things before I do the changes in the original file), and in the original file the individual slurm files for each submission for each rule is kept in slurm_snake
folder. However, there are no slurm files for these rules, what could be the reason and what am I doing wrong when submitting these to the cluster?
Here is an example of the slurm output of the main snakemake cluster submission:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1000
Job stats:
job count min threads max threads
--------- ------- ------------- -------------
GVCFSplit 5 1 1
all 1 1 1
total 6 1 1
Select jobs to execute...
[Thu Mar 14 14:47:00 2024]
rule GVCFSplit:
input: gvcf/12_19264_20
output: splits/sample_check/12_19264_20_done.txt
log: logs/GVCFSplit/12_19264_20_done.log
jobid: 5
benchmark: benchmarks/GVCFSplit/12_19264_20_done.benchmark.txt
reason: Missing output files: splits/sample_check/12_19264_20_done.txt
wildcards: SAMPLES=12_19264_20
resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp, partition=main, mem=1g, time=4:00:00, threads=1
python3 /path/to/script/GVCF_split.py 12_19264_20
The python script runs without any error when I run it from the terminal manually.
Apparently, this was caused by a mistake I made in my config file and, I will leave the question as a reminder and as a possible solution for the others that may suffer the same issue. The config file had a part something like below for the cluster submissions:
cluster: mkdir -p slurm_snake/`basename {workflow.main_snakefile}`/{rule} &&
sbatch
--partition={resources.partition}
--cpus-per-task={resources.threads}
--mem={resources.mem}
--time={resources.time}
--job-name=smk-{rule}-{wildcards}
--output=slurm_snake/`basename {workflow.main_snakefile}`/{rule}/{rule}-{wildcards}-%j.out
This part basically tells the cluster what resources to use and where to write the slurm output etc. The mistake I had here was having the wrong directory in the --output
flag, something like main_dir/slurm-snake/...
. Therefore, cluster could not write the output to a directory that doesn't exist, and the rule submissions failed immediately. After I fixed this issue, my pipeline works smoothly.