Search code examples
nextflow

How to specify optional inputs for nextflow processes?


I'm new to nextflow and have been trying to create a small pipeline for some python scripts I have. However, I have encountered an issue regarding optional inputs to processes that I can't seem to figure out a workaround for. I'm also curious what best practices would be for optional inputs and parameters.

#!/usr/bin/env nextflow

params.out = ""
params.kml_1 = null
params.kml_2 = null
params.loc = ""
params.new_data_1 = false
params.new_data_2 = false

process getPolygons {
    input:
    tuple val(db_table), path(path_to_kml), val(new_data)
    val loc
    path path_to_outdir

    def new_data_arg = new_data ? "--new_data" : ""
    def kml_arg = (path_to_kml != null) ? "--kml $path_to_kml" : ""

    script:
    """
    python3 ${baseDir}/bin/polygon_data.py --loc $loc --db_table $db_table $kml_arg $new_data_arg --outdir $path_to_outdir
    """
}

workflow {
    outdir_ch = Channel.fromPath(params.out)
    location_ch = Channel.of(params.loc)

    tables = [
        tuple("Table1", params.kml_1 ? params.new_data_1 : null, params.new_data_1),
        tuple("Table2", params.kml_2 ? params.new_data_2: null, params.new_data_2)
    ]
    tables_ch = Channel.from(tables)

    getPolygons(tables_ch, location_ch, outdir_ch)
}

The code worked prior to adding in the optional inputs. This was before I had made tables a list of tuples in order to account for the optional parameters in getPolygons: path_to_kml and new_data, instead it was:

tables = ["Table1", "Table2"]

I keep running into the error

ERROR ~ No such variable: new_data or ERROR ~ No such variable: path_to_kml

depending on the order of creating the variables new_data_arg and kml_arg.

Trying the tuple method is the latest thing I have done to address this issue that the program has with the optional parameters new_data and path_to_kml. I previously had them as separate inputs to getPolygons. Could the issue be with creating the variables new_data_arg and kml_arg and using them in the script instead of using new_data and path_to_kml directly? If so, I'm not really sure what the work around is because for my purposes, I need some logic applied to new_data and path_to_kml before adding this information when invoking polygon_data.py.


Solution

  • I have found a solution to this that utilized tuples. First the ERROR ~ No such variable issues were due to the variables new_data_arg and kml_arg not being inside the script component of the process (rookie mistake).

    Next, I realized that this would not iterate over the tuples, so I was able to utilize each to do so passing in the tuple as the variable tuple_info like so, and used "" instead of null for the path_to_kml as it is a path and there could be issues with null. so this is the final workable version for my process:

    process getPolygons {
        input:
        each tuple_info
        val loc
        path path_to_outdir
    
        script:
        def (db_table, path_to_kml, new_data) = tuple_info
        def new_data_arg = new_data ? "--new_data" : ""
        def kml_arg = (path_to_kml != "") ? "--kml $path_to_kml" : ""
    
        """
        python3 ${baseDir}/bin/polygon_data.py --loc $loc --db_table $db_table $kml_arg $new_data_arg --outdir $path_to_outdir
        """
    }
    

    I also realize that I could have simplified the tables list as theres no reason to build extra logic surrounding params.kml_1 and params.kml_2 when the initialization of the parameters handles this.

    tables = [
        tuple("Table1", params.kml_1, params.new_data_1),
        tuple("Table2", params.kml_2, params.new_data_2)
    ]