Search code examples
parallel-processingcluster-computinghpclsfnextflow

How to convert a loop to a Job-array in LSF cluster


I have 100 files, and I want to parallelise my submission to save time instead of running jobs one by one. How can I change this script to a Job-array in LSF using bsub submission system and run 10 jobs at every time?

#BSUB -J ExampleJob1         #Set the job name to "ExampleJob1"
#BSUB -L /bin/bash           #Uses the bash login shell to initialize the job's execution environment.
#BSUB -W 2:00                #Set the wall clock limit to 2hr
#BSUB -n 1                   #Request 1 core
#BSUB -R "span[ptile=1]"     #Request 1 core per node.
#BSUB -R "rusage[mem=5000]"  #Request 5000MB per process (CPU) for the job
#BSUB -M 5000                #Set the per process enforceable memory limit to 5000MB.
#BSUB -o Example1Out.%J      #Send stdout and stderr to "Example1Out.[jobID]"

path=./home/

for each in *.bam 
do 
samtools coverage ${each} -o ${each}_coverage.txt
done

Thank you for your time; any help is appreciated. I am a starter at LSF and am quite confused.


Solution

  • You tagged your question with , so I will provide a minimal (untested) solution using Nextflow by enabling the LSF executor. By using Nextflow, we can abstract away the underlying job submission system and focus on writing the pipeline however trivial. I think this approach is preferable, but it does place a dependency on Nextflow. I think it's a small one and maybe it's overkill for your current requirements, but Nextflow comes with other benefits, like being able to modify and resume when those requirements inevitably change.

    Contents of main.nf:

    params.bam_files = './path/to/bam_files/*.bam'
    params.publish_dir = './results'
    
    
    process samtools_coverage {
    
        tag { bam.baseName }
    
        publishDir "${params.publish_dir}/samtools/coverage", mode: 'copy'
    
        cpus 1
        memory 5.GB
        time 2.h
    
        input:
        path bam
    
        output:
        path "${bam.baseName}_coverage.txt"
    
        """
        samtools coverage \\
            -o "${bam.baseName}_coverage.txt" \\
            "${bam}"
        """
    }
    
    workflow {
    
        bam_files = Channel.fromPath( params.bam_files )
    
        samtools_coverage( bam_files )
    }
    

    Contents of nextflow.config:

    process {
    
        executor = 'lsf'
    }
    

    Run using:

    nextflow run main.nf
    

    Note also:

    LSF supports both per-core and per-job memory limit. Nextflow assumes that LSF works in the per-core memory limits mode, thus it divides the requested memory by the number of requested cpus.

    This is not required when LSF is configured to work in per-job memory limit mode. You will need to specified that adding the option perJobMemLimit in Scope executor in the Nextflow configuration file.