Search code examples
groovynextflow

Nextflow capture output file by partial pattern


I've got a Nextflow process that looks like:

process my_app {

    publishDir "${outdir}/my_app", mode: params.publish_dir_mode

    input:
        path input_bam
        path input_bai
        val output_bam
        val max_mem
        val threads
        val container_home
        val outdir

    output:
        tuple env(output_prefix), path("${output_bam}"), path("${output_bam}.bai"), emit: tuple_ch

    shell:
        '''
        my_script.sh \
            !{input_bam} \
            !{output_bam} \
            !{max_mem} \
            !{threads}

        output_prefix=$(echo !{output_bam} | sed "s#.bam##")
        '''
}

This process is only creating two .bam .bai files but my_script.sh is also creating other .vcf that are not being published in the output directory.

I tried it by doing in order to retrieve the files created by the script but without success:

output:
    tuple env(output_prefix), path("${output_bam}"), path("${output_bam}.bai"), path("${output_prefix}.*.vcf"), emit: mt_validation_simulation_tuple_ch

but in logs I can see:

Error executing process caused by:
  Missing output file(s) `null.*.vcf` expected by process `my_app_wf:my_app`

What I am missing? Could you help me? Thank you in advance!


Solution

  • The problem is that the output_prefix has only been defined inside of the shell block. If all you need for your output prefix is the file's basename (without extension), you can just use a regular script block to check file attributes. Note that variables defined in the script block (but outside the command string) are global (within the process scope) unless they're defined using the def keyword:

    process my_app {
    
        ...
    
        output:
        tuple val(output_prefix), path("${output_bam}{,.bai}"), path("${output_prefix}.*.vcf")
    
        script:
        output_prefix = output_bam.baseName
    
        """
        my_script.sh \\
            "${input_bam}" \\
            "${output_bam}" \\
            "${max_mem}" \\
            "${threads}"
        """
    }
    

    If the process creates the BAM (and index) it might even be possible to refactor away the multiple input channels if an output prefix can be supplied up front. Usually this makes more sense, but I don't have enough details to say one way or the other. The following might suffice as an example; you may need/prefer to combine/change the output declaration(s) to suit, but hopefully you get the idea:

    params.publish_dir = './results'
    params.publish_mode = 'copy'
    
    process my_app {
    
        publishDir "${params.publish_dir}/my_app", mode: params.publish_mode
    
        cpus 1
        memory 1.GB
    
        input:
        tuple val(prefix), path(indexed_bam)
    
        output:
        tuple val(prefix), path("${prefix}.bam{,.bai}"), emit: bam_files
        tuple val(prefix), path("${prefix}.*.vcf"), emit: vcf_files
    
        """
        my_script.sh \\
            "${indexed_bam.first()}" \\
            "${prefix}.bam" \\
            "${task.memory.toGiga()}G" \\
            "${task.cpus}"
        """
    }
    

    Note that the indexed_bam expects a tuple in the form: tuple(bam, bai)