Search code examples
groovydslnextflow

Nextflow - I/O and index issues


I am new to nextflow and am trying to build a pipeline. I am having trouble with passing outputs into new processes, properly capturing outputs, and with BWA-MEM2 index designation.

Here is my script:

#!/usr/bin/env nextflow
// Declare syntax version
nextflow.enable.dsl = 2

// Set parameters
params.fastq = "$projectDir/fastq/*_{1,2}.f*"
params.refgenome = "$projectDir/RefGenome/hg38.fa"

workflow {

    read_pairs_ch = Channel.fromFilePairs(params.fastq, checkIfExists: true) 

    trimAndQC(read_pairs_ch)
    readMapping(params.refgenome, trimAndQC.out.trimmedreads)
}

process trimAndQC {
    debug true
    tag "$sample_id"

    publishDir 'trimmed/', mode: 'copy', overwrite: false

    input:
    tuple val(sample_id), path(fastq)

    output:
    tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads

    shell:
    """
    trim_galore --paired --fastqc ${fastq.get(0)} ${fastq.get(1)}
    """
}

// BWA read mapping
process readMapping {
    debug true
    tag "$sample_id"

    publishDir 'bwa_aligned/', mode: 'copy', overwrite: false

    input:
    path(refgenome)
    tuple val(sample_id), val(trimmedreads)

    output:
    path('${sample_id}_sorted.bam'), emit: sorted_bam

    script:
    def indexbase = refgenome.baseName
    
    """
    bwa-mem2 mem -t 2 '${indexbase}' '${trimmedreads}' | samtools sort -@2 -o '${sample_id}_sorted.bam' -
    """
}

First, I am getting this warning:

WARN: Input tuple does not match input set cardinality declared by process `readMapping` -- offending value: [sample_id, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_2_val_2.fq.gz]

Which is confusing as my output is a tuple and thus should match the cardinality of the output of trimAndQC. As an aside, I have noticed most scripts passing the input tuple (in this case path(fastq)) as a single argument, but to get the command to work I have to use the .get() accession method to 'unpack' the tuple, so I am unsure what I am doing wrong there.

Second, I am getting this error in regards to the output of readMapping process:

ERROR ~ Error executing process > 'readMapping (sample_id)'

Caused by:
  Missing output file(s) `${sample_id}_sorted.bam` expected by process `readMapping (sample_id)`

Command executed:

  bwa-mem2 mem -t 2 'hg38' '/Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz' | samtools sort -@2 -o sample_id_sorted.bam -

Command exit status:
  0

Command output:
  (empty)

I assume this may be due to the cardinality warning above and sample_id is empty.

Finally, I cannot seem to designate the index properly to the BWA-MEM2 command. The error:

Command error:
  Looking to launch executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42", simd = .sse42
  Launching executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42"
  -----------------------------
  Executing in SSE4.2 mode!!
  -----------------------------
  * SA compression enabled with xfactor: 8
  * Ref file: hg38
  * Entering FMI_search
  ERROR! Unable to open the file: hg38.bwt.2bit.64

Work dir:
  /Users/me/Desktop/fastq/work/23/f8b3b6eace6f11f8619ce435b16725

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

In my RefGenome directory I have hg38.fa indexed on an AWS EC2 and downloaded to locally to prototype the pipeline. The files look have these extensions: hg38.fa.{0123,amb,ann,pac,bwt.2bit.64} and appear to work fine if I just run the bwa-mem2 mem command in my terminal.

I have tried to adapt multiple SO answers (e.g., how to chain Channel.fromFilePairs from one process to another process, nextflow: how to pass directory path with files created in process before), I have gone through the EMBL training (Nextflow), looked through pipelines on nf-core and github, and have spent time with nextflow documentation prior to asking this question. I seem to be misunderstanding some inherent principle of nextflow so any advice on what to focus would be appreciated.

Finally, for clarity and reproducibility I am using the following fastqs from NCBI SRA (SRX317818) for prototyping. These are the versions of nextflow, trim-galore, and bwa-mem2 I am using all via an Anconda environment:

bwa-mem2                  2.2.1                hdf58011_5    bioconda
nextflow                  23.10.1              hdfd78af_0    bioconda
trim-galore               0.6.10               hdfd78af_0    bioconda

I have crossposted this question on Biostars (Nextflow - I/O and index issues). Based on this post, it appears crossposting is tolerated to a degree (Is cross-posting a question...). I can remove the post if necessary.


Solution

  • Thanks to @dthorbur at Biostars for his help with this problem.


    I have finally completely resolved all my problems and want to document here in case anyone else runs into such issues in the future.

    First problem was issues with cardinality. As @dthorbur pointed out, my issue was not keeping the same tuple structure between processes. The output from the upstream process:

    output:
    tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz')
    

    Whereas the input into the downstream process:

    input:
    tuple val(sample_id), val(trimmedreads)
    

    Resulting in an easy fix:

    input:
    tuple val(sample_id), path(read1), path(read2)
    

    Next issue I was having was ingesting the index for BWA into my working environment. I found the most straightforward and simple method was to designate the index directory and fasta file separately like this:

    // Set paths for index directory and the reference fasta
    params.index_dir = "$projectDir/RefGenome/BWAIndex"
    params.ref = "hg38.fa"
    ------------------------
    ...
    // Designate the index within the processes like this
    script:
    """
    bwa mem -t 2 ${index_dir}/${ref} ${read1} ${read2} | samtools sort -@2 -o "${sample_id}_sorted.bam" -
    """
    ------------------------
    
    // Define the channels and pass them to the process
    index_ch = Channel.fromPath(params.index_dir)
    ref_ch = Channel.of(params.ref)
    ...
    bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
    

    My final problem was getting the bwa_align process to run properly. My original error was related to no output file, which I realized was an issue with single quotes around the output file and within the script block:

    output:
    path('${sample_id}_sorted.bam')
    

    which Nextflow wasn't interpreting properly. Once I switched to double quotes all was fine. So I am tempted to strictly use double quotes when writing Nextflow scripts.

    Here is my final script that finally worked:

    #!/usr/bin/env nextflow
    nextflow.enable.dsl = 2
    
    // Define parameters
    params.fastq = "$projectDir/fastq/*_{1,2}*"
    params.index_dir = "$projectDir/RefGenome/BWAIndex"
    params.ref = "hg38.fa"
    
    workflow {
    
        fastq_ch = Channel.fromFilePairs(params.fastq)
        index_ch = Channel.fromPath(params.index_dir)
        ref_ch = Channel.of(params.ref)
    
        trim_QC(fastq_ch)
        bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
    }
    
    process trim_QC {
        debug true
        tag "$sample_id"
    
        publishDir 'trimmed/', mode: 'copy', overwrite: false
    
        input:
        tuple val(sample_id), path(fastq)
    
        output:
        tuple val("${sample_id}"), path("*_val_1.fq.gz"), path("*_val_2.fq.gz"), emit: trimmed_reads
    
        script:
        def (read1, read2) = fastq
        """
        trim_galore --paired --fastqc ${read1} ${read2}
        """
    }
    
    process bwa_align {
        debug true
        tag "$sample_id"
    
        publishDir 'bwa_aligned/', mode: 'copy', overwrite: false
    
        input:
        path index_dir
        val ref
        tuple val(sample_id), path(read1), path(read2)
    
        output:
        path("${sample_id}_sorted.bam"), emit: sorted_bam
    
        script:
        """
        bwa mem -t 2 ${index_dir}/${ref} ${read1} ${read2} | samtools sort -@2 o "${sample_id}_sorted.bam" -
        """
    }