Search code examples
nextflow

Passing list of filenames to nextflow process


I am a newcomer to Nextflow and I am trying to process multiple files in a workflow. The number of these files is more than 300, so I would like to not to paste it into a command line as an option. So what I have done is I've created a file with every filename of the files I need to process, but I am not sure how to pass it into the process. This is what I've tried:

params.SRRs = "srr_ids.txt"


process tmp {
  input:
    file ids

  output:
    path  "*.txt"

  script:
    '''
    while read id; do
      touch ${id}.txt;
      echo ${id} > ${id}.txt;
    done < $ids
    '''
}

workflow {
  tmp(params.SRRs)
}

The script is supposed to read in the file srr_ids.txt, and create files that have their ids in it (just testing on a smaller task). The error log says that the id variable is unbound, but I don't understand why. What is the conventional way of passing lots of filenames to a pipeline? Should I write some other process that parses the list?


Solution

  • Maybe there's a typo in your question, but the error is actually that the ids variable is unbound:

    Command error:
      .command.sh: line 5: ids: unbound variable
    

    The problem is that when you use a single-quote script string, you will not be able to access Nextflow variables in your script block. You can either define your script using a double-quote string and escape your shell variables:

    params.SRRs = "srr_ids.txt"
    
    
    process tmp {
      input:
        path ids
    
      output:
        path  "*.txt"
    
      script:
        """
        while read id; do
          touch "\${id}.txt"
          echo "\${id}" > "\${id}.txt"
        done < "${ids}"
        """
    }
    
    workflow {
    
      SRRs = file(params.SRRs)
    
      tmp(SRRs)
    }
    

    Or, use a shell block which uses the exclamation mark ! character as the variable placeholder for Nextflow variables. This makes it possible to use both Nextflow and shell variables in the same piece of code without having to escape each of the shell variables:

    params.SRRs = "srr_ids.txt"
    
    
    process tmp {
      input:
        path ids
    
      output:
        path  "*.txt"
    
      shell:
        '''
        while read id; do
          touch "${id}.txt"
          echo "${id}" > "${id}.txt"
        done < "!{ids}"
        '''
    }
    
    workflow {
    
      SRRs = file(params.SRRs)
    
      tmp(SRRs)
    }
    

    What is the conventional way of passing lots of filenames to a pipeline?

    The conventional way, I think, is to actually supply one (or more) glob patterns to the fromPath channel factory method. For example:

    params.SRRs = "./path/to/files/SRR*.fastq.gz"
    
    workflow {
    
      Channel
        .fromPath( params.SRRs )
        .view()
    }
    

    Results:

    $ nextflow run main.nf
    N E X T F L O W  ~  version 22.04.4
    Launching `main.nf` [sleepy_bernard] DSL2 - revision: 30020008a7
    /home/steve/working/stackoverflow/73702711/path/to/files/SRR1910483.fastq.gz
    /home/steve/working/stackoverflow/73702711/path/to/files/SRR1910482.fastq.gz
    /home/steve/working/stackoverflow/73702711/path/to/files/SRR1448795.fastq.gz
    /home/steve/working/stackoverflow/73702711/path/to/files/SRR1448793.fastq.gz
    /home/steve/working/stackoverflow/73702711/path/to/files/SRR1448794.fastq.gz
    /home/steve/working/stackoverflow/73702711/path/to/files/SRR1448792.fastq.gz
    

    If instead you would prefer to pass in a list of filenames, like in your example, use either the splitCsv or the splitText operator to get what you want. For example:

    params.SRRs = "srr_ids.txt"
    
    workflow {
      
      Channel
        .fromPath( params.SRRs )
        .splitText() { it.strip() }
        .view()
    }
    

    Results:

    $ nextflow run main.nf 
    N E X T F L O W  ~  version 22.04.4
    Launching `main.nf` [fervent_ramanujan] DSL2 - revision: 89a1771d50
    SRR1448794
    SRR1448795
    SRR1448792
    SRR1448793
    SRR1910483
    SRR1910482
    

    Should I write some other process that parses the list?

    You may not need to. My feeling is that your code might benefit from using the fromSRA factory method, but we don't really have enough details to say one way or the other. If you need to, you could just write a function that returns a channel.