Search code examples
bashgroovynextflow

How do you pass output from one Nextflow Channel to another and run an .Rmd file?


I have a Nextflow pipeline that has two channels.

  • The first channel runs and outputs 6 .tsv files to a folder called 'results'.
  • The second channel is supposed to use all of these 6 .tsv files and create a .pdf report using knitr in R in a process called 'createReport'.

My workflow code looks like this:

workflow {
  inputFileChannel = Channel.fromPath(params.pathOfInputFile, type: 'file') // | collect | createReport // creating channel to pass in input file
  findNumOfProteins(inputFileChannel)  // passing in the channel to the process
  findAminoAcidFrequency(inputFileChannel)
  getProteinDescriptions(inputFileChannel)
  getNumberOfLines(inputFileChannel)
  getNumberOfLinesWithoutSpaces(inputFileChannel)
  getLengthFreq(inputFileChannel)

  outputFileChannel = Channel.fromPath("$params.outdir.main/*.tsv", type: 'file').buffer(size:6)
  createReport(outputFileChannel)

My 'createReport' process currently looks like this:

process createReport {
  module 'R/4.2.2'

  publishDir params.outdir.output, mode: 'copy'


  output:
    path 'report.pdf'

  script:
      """
          R -e "rmarkdown::render('./createReport.Rmd')"
      """
}

And my 'createReport.Rmd' looks like this (tested in Rstudio and gives the correct .pdf output:

---
title: "R Markdown Practice"
author: "-"
date: "2022-12-08"
output: pdf_document
---

{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(readr)
dataSet <- list.files(path="/Users/-/Desktop/code/nextflow_practice/results/", pattern="*.tsv")
print(dataSet)

for (data in dataSet) {
  print(paste("Showing the table for:", data))
  targetData <- read.table(file=paste("/Users/-/Desktop/code/nextflow_practice/results/", data, sep=""),
             head=TRUE,
             nrows=5,
             sep="\t") 
  print(targetData)
  
  if (data == "length_data.tsv") {
    data_to_graph <- read_tsv(paste("/Users/-/Desktop/code/nextflow_practice/results/", data, sep=""), show_col_types = FALSE)
    plot(x = data_to_graph$LENGTH,y = data_to_graph$FREQ, xlab = "x-axis", ylab = "y-axis", main = "P")
  }

  writeLines("-----------------------------------------------------------------")
}

What would be the correct way to write the createReport process and the workflow sections so as to be able to pass the 6 .tsv outputs from the first channel into the second channel to create the report?

Sorry I am very new to Nextflow and the documentation doesn't help me as much as I would like it to!


Solution

  • Your outputFileChannel looks like it is trying to access files in the publishDir. The problem with accessing files in this directory (i.e. 'results') is that:

    Files are copied into the specified directory in an asynchronous manner, thus they may not be immediately available in the published directory at the end of the process execution. For this reason files published by a process must not be accessed by other downstream processes.

    Assuming your inputFileChannel is intended to be a value channel, you could use the following. This requires the outputs of the six process to be declared in their output blocks (using the path qualifier). We could then just mix and collect these files. Your Rmd file and list of TSV files could then be passed to your createReport process. Note that if you move your Rmd into the base directory of your pipeline project (i.e. in the same directory as your main.nf script), you can distribute it with your workflow. By providing the Rmd over a channel, this approach ensures it is staged into the process working directory when the job is run. For example:

    workflow {
    
        inputFile = file( params.pathOfInputFile )
    
        findNumOfProteins( inputFile )
        findAminoAcidFrequency( inputFile )
        getProteinDescriptions( inputFile )
        getNumberOfLines( inputFile )
        getNumberOfLinesWithoutSpaces( inputFile )
        getLengthFreq( inputFile )
    
        Channel.empty() \
            | mix( findNumOfProteins.out ) \
            | mix( findAminoAcidFrequency.out ) \
            | mix( getProteinDescriptions.out ) \
            | mix( getNumberOfLines.out ) \
            | mix( getNumberOfLinesWithoutSpaces.out ) \
            | mix( getLengthFreq.out ) \
            | collect \
            | set { outputs }
    
        rmd = file("${baseDir}/createReport.Rmd")
    
        createReport( outputs, rmd )
    }
    
    process createReport {
    
        module 'R/4.2.2'
    
        publishDir "${params.outdir}/report", mode: 'copy'
    
        input:
        path 'input_dir/*'
        path rmd
    
        output:
        path 'report.pdf'
    
        """
        Rscript -e "rmarkdown::render('${rmd}')"
        """
    }
    

    Note that the createReport process above will stage the input TSV files under a folder called 'input_dir' in the process working directory. You could change this if you want to, but I think this keeps the working directory neat and tidy. Just be sure to modify your Rmd script to point to this folder. For example, you might choose to use something like:

    dataSet <- list.files(path="./input_dir", pattern="*.tsv")
    

    Or perhaps even:

    dataSet <- list.files(pattern="*.tsv", recursive=TRUE)