Can Snakemake parallelize the same rule both within and across nodes?

I have a somewhat basic question about Snakemake parallelization when using cluster execution: can jobs from the same rule be parallelized both within a node and across multiple nodes at the same time?

Let's say for example that I have 100 bwa mem jobs and my cluster has nodes with 40 cores each. Could I run 4 bwa mem per node, each using 10 threads, and then have Snakemake submit 25 separate jobs? Essentially, I want to parallelize both within and across nodes for the same rule.

Here is my current snakefile:

SAMPLES, = glob_wildcards("fastqs/{id}.1.fq.gz")
print(SAMPLES)

rule all:
        input:
                expand("results/{sample}.bam", sample=SAMPLES)

rule bwa:
    resources:
        time="4:00:00",
        partition="short-40core"
    input:
        ref="/path/to/reference/genome.fa",
        fwd="fastqs/{sample}.1.fq.gz",
        rev="fastqs/{sample}.2.fq.gz"
    output:
        bam="results/{sample}.bam"
    log:
        "results/logs/bwa/{sample}.log"
    params:
        threads=10
    shell:
        "bwa mem -t {params.threads} {input.ref} {input.fwd} {input.rev} 2> {log} | samtools view -bS - > {output.bam}"

I've run this with the following command:

snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25

With this setup, I get 25 jobs submitted, each to a different node. However, only one bwa mem process (using 10 threads) is run per node.

Is there some straightforward way to modify this so that I could get 4 different bwa mem jobs (each using 10 threads) to run on each node?

Thanks!

Dave

Edit 07/28/22:

In addition to Troy's suggestion below, I found a straightforward way of accomplishing what I was trying to do by simply following the job grouping documentation.

Specifically, I did the following when executing my Snakemake pipeline:

snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25 --groups bwa=group0 --group-components group0=4 --rerun-incomplete --cores 40

By specifying a group ("group0") for the bwa rule and setting "--group-components group0=4", I was able to group the jobs such that 4 bwa runs are occurring on each node.

Solution

You can try job grouping but note that resources are typically summed together when submitting group jobs like this. Usually that's not what is desired, but in your case it seems to be correct.

Instead you can make a group job with another rule that does the grouping for you in batches of 4.

rule bwa_mem:
    group: 'bwa_batch'
    output: '{sample}.bam'
    ...

def bwa_mem_batch(wildcards):
    # for wildcard.i, pick 4 bwa_mem outputs to put in this group
    return expand('{sample}.bam', sample=SAMPLES[i*4:i*4+4])

rule bwa_mem_batch:
    input: bwa_mem_batch_input
    output: touch('flag_{i}')  # could be temp too
    group 'bwa_batch'

The consuming rule must request flag_{i} for i in {0..len(SAMPLES)//4}. With cluster integration, each slurm job gets 1 bwa_mem_batch job and 4 bwa_mem jobs with resources for a single bwa_mem job. This is useful for batching together multiple jobs to increase the runtime.

As a final point, this may do what you want, but I don't think it will help you get around QOS or other job quotas. You are using the same amount of CPU hours either way. You may be waiting in the queue longer because the scheduler can't find 40 threads to give you at once, where it could have given you a few 10 thread jobs. Instead, consider refining your resource values to get better efficiency, which may get your jobs run earlier.