Search code examples
mappingworkflownextflow

bwa fail to load index using nextflow


I am writing a bwa mapping module using nextflow (dsl=2), modules/map_reads.nf to map single-end reads. When I execute this workflow it does not return error from the terminal and it also output bam files with the correct file names. However, I found that the bam files are not correctly mapped and I also found in .command.err an error:

[E::bwa_idx_load_from_disk] fail to locate the index files

I have checked the paths are correct and also execute shell command directly in terminal. I appreciate any suggestions or solution to this problem.

modules/map_reads.nf

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

process mapping {
    
    conda 'envs/bwa.yml'
    publishDir 'results/mapped', mode: 'copy'
  
    input:
        tuple val(sample_id), file(fastq)
        file index

    output:
        tuple val(sample_id), file('*.bam')

    script:
        """
        bwa mem $index $fastq | samtools view -b - > ${sample_id}.bam
        """

}

workflow {
  fastq_data = channel.fromPath( 'data/samples/*.fastq' ).map { file -> tuple(file.baseName, file) }
  index = channel.fromPath( 'data/genome.fa' )
  mapping( fastq_data, index )
}

Here is my directory structure:

Directory Structure

envs/bwa.yml

name: bwa
channels:
  - bioconda
  - defaults
dependencies:
  - bwa
  - samtools=1.9

Solution

  • [E::bwa_idx_load_from_disk] fail to locate the index files

    BWA MEM is expecting a number of index files to be provided with it's first argument, but you've only localized the genome FASTA file:

    index = channel.fromPath( 'data/genome.fa' )
    

    BWA MEM only requires the actual index files. I.e. it doesn't require the FASTA file (or FASTA index), so you can save some time and resources by skipping the localization of the FASTA (this is especially relevant if you localize from s3 for example, since the FASTA file is often a couple of gigabytes). Also ensure you use a value channel here, so that the channel can be used an unlimited number of times:

    process mapping {
        
        conda 'envs/bwa.yml'
        publishDir 'results/mapped', mode: 'copy'
      
        input:
        tuple val(sample_id), path(fastq)
        path bwa_index
    
        output:
        tuple val(sample_id), path("${sample_id}.bam")
    
        script:
        def idxbase = bwa_index[0].baseName
    
        """
        bwa mem "${idxbase}" "${fastq}" | samtools view -b - > "${sample_id}.bam"
        """
    
    }
    
    workflow {
    
        fastq_data = channel.fromPath( 'data/samples/*.fastq' ).map { file ->
            tuple(file.baseName, file)
        }
    
        bwa_index = file( 'data/genome.fa.{,amb,ann,bwt,pac,sa}' )
    
        mapping( fastq_data, bwa_index )
    }
    

    Also, the reason you didn't see an error on the command line is that, by default, the return value of a pipeline is the exit status of the last command. In your BWA MEM SAMtools pipeline, the last command (i.e. samtools) completes successfully and returns exit status zero. The option you want to add is called 'pipefail', man bash:

    pipefail

    If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default.

    Usually, you can just add the following to your nextflow.config to have it applied to all processes:

    process {
    
      shell = [ '/bin/bash', '-euo', 'pipefail' ]
    }