Search code examples
pythonimportnextflow

Error importing python modules in nextflow script block


I have a similar problem to those described here and here. The code is as follows:


    process q2_predict_dysbiosis { publishDir 'results', mode: 'copy'
    
    input:
    path abundance_file
    path species_abundance_file
    path stratified_pathways_table
    path unstratified_pathways_table
    
    output:
    path "${abundance_file.baseName}_q2pd.tsv"
    
    script:
    """
    #!/usr/bin/env python
    
    from q2_predict_dysbiosis import calculate_index
    import pandas as pd
    
    pd.set_option('display.max_rows', None)
    
    taxa = pd.read_csv("${species_abundance_file}", sep="\\t", index_col=0)
    paths_strat = pd.read_csv("${stratified_pathways_table}", sep="\\t", index_col=0)
    paths_unstrat = pd.read_csv("${unstratified_pathways_table}", sep="\\t", index_col=0)
    
    score_df = calculate_index(taxa, paths_strat, paths_unstrat)
    score_df.to_csv("${abundance_file.baseName}_q2pd.tsv", sep="\\t", float_format="%.2f")
    """
    }

Obtained error:

Caused by:
  Process `q2_predict_dysbiosis (1)` terminated with an error exit status (1)


Command executed:

  #!/usr/bin/env python

  from q2_predict_dysbiosis import calculate_index
  import pandas as pd

  pd.set_option('display.max_rows', None)

  taxa = pd.read_csv("abundance1-taxonomy_table.txt", sep="\t", index_col=0)
  paths_strat = pd.read_csv("pathways_stratified.txt", sep="\t", index_col=0)
  paths_unstrat = pd.read_csv("pathways_unstratified.txt", sep="\t", index_col=0)

  score_df = calculate_index(taxa, paths_strat, paths_unstrat)
  score_df.to_csv("abundance1_q2pd.tsv", sep="\t", float_format="%.2f")

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File ".command.sh", line 3, in <module>
      from q2_predict_dysbiosis import calculate_index
  ModuleNotFoundError: No module named 'q2_predict_dysbiosis'

I have followed the instructions in this link, but it still doesn't work. I would like to keep the code block like that, and not run a script.py file. I am using the code from this repository.

Thanks in advance!

UPDATE

To try to resolve the import error I have done the following:

  1. Creating a bin/ directory which is in the same directory as script.nf. No results.

  2. Changing the shebang declaration. No results.

q2_predict_dysbiosis is not installed (it has no installation instructions), but it runs locally. I think the problem is that Nextflow doesn't locate q2_predict_dysbiosis.py, even though it is in the ./bin directory.


Solution

  • The Python import system uses the following sequence to locate packages and modules to import:

    1. The current working directory (i.e. $PWD): This is the directory from which the Python interpreter was launched.

    2. The PYTHONPATH environment variable: If set, this environment variable can specify additional directories for Python to search for packages and modules.

    3. The sys.path list in the program: The paths in this list determine where Python looks for modules, and you can modify sys.path within your code to include additional directories.

    4. System-wide or virtual environment installed packages: These are the packages that have been globally installed on the system or within a virtual environment.


    A quick solution is to simply set the PYTHONPATH environment variable using the env scope in your nextflow.config. For example, with q2_predict_dysbiosis.py in a folder called packages in the root directory of your project repository (i.e. the directory where the main.nf script is located):

    env {
    
        PYTHONPATH = "${projectDir}/packages"
    }
    

    Tested using main.nf:

    process q2_predict_dysbiosis {
    
        debug true
    
        script:
        """
        #!/usr/bin/env python
        import sys
        print(sys.path)
    
        from q2_predict_dysbiosis import calculate_index
    
        assert 'q2_predict_dysbiosis' in sys.modules
        """
    }
    
    workflow {
    
        q2_predict_dysbiosis()
    }
    

    Results:

    $ nextflow run main.nf 
    
     N E X T F L O W   ~  version 24.10.0
    
    Launching `main.nf` [grave_avogadro] DSL2 - revision: 2f0c31286e
    
    executor >  local (1)
    [8f/50976f] q2_predict_dysbiosis [100%] 1 of 1 ✔
    [
        '/path/to/project/work/8f/50976fe453d54fd6e11b3501d4b05a',
        '/path/to/project/packages',
        '/usr/lib/python312.zip',
        '/usr/lib/python3.12',
        '/usr/lib/python3.12/lib-dynload',
        '/usr/lib/python3.12/site-packages'
    ]
    

    A better solution, though, is to refactor. Move your custom code into a separate file (e.g. your_script.py), place it in your bin directory and make it executable (chmod a+x bin/your_script.py). Also move q2_predict_dysbiosis.py into this directory or into a sub-directory called utils. I use the latter in my example below. Your directory structure might look like:

    $ find .
    .
    ./main.nf
    ./bin
    ./bin/utils
    ./bin/utils/q2_predict_dysbiosis.py
    ./bin/your_script.py
    

    And your_script.py might look like the following using argparse to provide a user-friendly command-line interface:

    #!/usr/bin/env python
    
    import argparse
    import pandas as pd
    
    from utils.q2_predict_dysbiosis import calculate_index
    
    pd.set_option('display.max_rows', None)
    
    def custom_help_formatter(prog):
        return argparse.HelpFormatter(prog, max_help_position=80)
    
    def parse_args():
        parser = argparse.ArgumentParser(
            description="Calculate dysbiosis index using abundance and pathways tables.",
            formatter_class=custom_help_formatter,
        )
    
        parser.add_argument(
            "species_abundance_file",
            help="Path to the species abundance file",
        )
        parser.add_argument(
            "stratified_pathways_table",
            help="Path to the stratified pathways table file",
        )
        parser.add_argument(
            "unstratified_pathways_table",
            help="Path to the unstratified pathways table file",
        )
        parser.add_argument(
            "output_file",
            help="Path to the output file to save the results",
        )
    
        return parser.parse_args()
    
    def main(
        species_abundance_file,
        stratified_pathways_table,
        unstratified_pathways_table,
        output_file
    ):
        taxa = pd.read_csv(species_abundance_file, sep="\t", index_col=0)
        paths_strat = pd.read_csv(stratified_pathways_table, sep="\t", index_col=0)
        paths_unstrat = pd.read_csv(unstratified_pathways_table, sep="\t", index_col=0)
        
        score_df = calculate_index(taxa, paths_strat, paths_unstrat)
        score_df.to_csv(output_file, sep="\t", float_format="%.2f")
    
    if __name__ == "__main__":
        args = parse_args()
    
        main(
            args.species_abundance_file,
            args.stratified_pathways_table,
            args.unstratified_pathways_table,
            args.output_file,
        )
    

    Tested using main.nf:

    $ cat main.nf 
    process q2_predict_dysbiosis {
    
        debug true
    
        script:
        """
        your_script.py --help
        """
    }
    
    workflow {
    
        q2_predict_dysbiosis()
    }
    
    

    Results:

    $ nextflow run main.nf 
    
     N E X T F L O W   ~  version 24.10.0
    
    Launching `main.nf` [peaceful_stonebraker] DSL2 - revision: fea21868c7
    
    executor >  local (1)
    [88/538f31] q2_predict_dysbiosis [100%] 1 of 1 ✔
    usage: your_script.py [-h] species_abundance_file stratified_pathways_table unstratified_pathways_table output_file
    
    Calculate dysbiosis index using abundance and pathways tables.
    
    positional arguments:
      species_abundance_file       Path to the species abundance file
      stratified_pathways_table    Path to the stratified pathways table file
      unstratified_pathways_table  Path to the unstratified pathways table file
      output_file                  Path to the output file to save the results
    
    options:
      -h, --help                   show this help message and exit
    
    

    If your dependencies also require certain local files to run, place the required files into a sub-directory in your project repository. Declare these files in your workflow block (e.g. using data_dir = path("${projectDir}/data")) and append entries for these in your processes' input block. If the names of the input files are hardcoded in your Python script, supply a string value to path to ensure that Nextflow stages the files with the correct filename(s) (e.g. using path 'data'). Once the files are localized in the process working directory, python should be able to find them. This assumes the path(s) in your Python script are relative and not absolute paths. If they are absolute paths, you will need to make them relative. A minimal example might look like:

    process test_proc {
    
        debug true
    
        input:
        path 'data'
    
        script:
        """
        ls -1 data/{foo,bar,baz}.txt
        """
    }
    
    workflow {
    
        data_dir = "${projectDir}/data"
    
        test_proc( data_dir )
    }
    
    $ mkdir data
    $ touch data/{foo,bar,baz}.txt
    $ nextflow run main.nf 
    
     N E X T F L O W   ~  version 24.10.0
    
    Launching `main.nf` [prickly_nightingale] DSL2 - revision: 83d939e180
    
    executor >  local (1)
    [ec/bf1f56] process > test_proc [100%] 1 of 1 ✔
    data/bar.txt
    data/baz.txt
    data/foo.txt