Search code examples
pythondockernextflow

Nextflow Execution Environment Differs Between Processes


I am defining two nextflow processes. The first one, scatter(), creates two files. Then, parallel() is spawned twice, once for each file.

Here is my setup.

// bug.nf
nextflow.enable.dsl = 2

workflow {
    main:
        scatter(params.config)

        scatter.out.configs
            | flatten
            | parallel
}

process scatter {
    container "python:3.11.8"

    input:
        path "config.txt"

    output:
        path "config*.txt", emit: configs

    script:
        """
        echo $PWD
        ls -hal /home/alex/my_cool_repo

        touch config1.txt
        touch config2.txt
        """
}

process parallel {
    container "python:3.11.8"
    
    input:
        path "config.txt"

    script:
        """
        echo $PWD
        ls -hal /home/alex/my_cool_repo
        """
}
// run command
nextflow run nextflow/bug.nf --config /home/alex/my_cool_repo/my_cool_repo/config/bla.txt

The ls output from all processes should look the same but it does not.

Output from scatter() (truncated):

/home/alex/my_cool_repo
total 656K
drwxrwxr-x 16 1035 1036 4.0K Feb 17 13:20 .
drwxr-xr-x  3 root root 4.0K Feb 17 13:20 ..
-rw-rw-r--  1 1035 1036 3.3K Feb 17 11:09 .dockerignore
-rw-rw-r--  1 1035 1036 3.2K Feb  6 15:33 .gitignore
drwxrwxr-x  4 1035 1036 4.0K Feb 17 13:20 .nextflow
-rw-rw-r--  1 1035 1036 5.4K Feb 17 13:20 .nextflow.log
-rw-rw-r--  1 1035 1036    5 Jan 26 18:18 .python-version
drwxrwxr-x  6 1035 1036 4.0K Feb  7 14:20 .venv
drwxrwxr-x  2 1035 1036 4.0K Feb  6 13:28 .vscode
-rw-rw-r--  1 1035 1036  848 Feb 17 12:28 Dockerfile
-rw-rw-r--  1 1035 1036  627 Feb  6 15:33 README.md
drwxrwxr-x  3 1035 1036 4.0K Feb 17 12:55 nextflow
-rw-rw-r--  1 1035 1036 527K Feb 17 11:45 poetry.lock
-rw-rw-r--  1 1035 1036   32 Jan 26 18:18 poetry.toml
-rw-rw-r--  1 1035 1036 2.2K Feb 16 19:36 pyproject.toml
drwxrwxr-x  9 1035 1036 4.0K Feb  6 13:28 my_cool_repo
drwxrwxr-x  3 1035 1036 4.0K Feb 17 13:20 work

Output from the two parallel() processes:

/home/alex/my_cool_repo
total 12K
drwxr-xr-x 3 root root 4.0K Feb 17 13:20 .
drwxr-xr-x 3 root root 4.0K Feb 17 13:20 ..
drwxrwxr-x 5 1035 1036 4.0K Feb 17 13:20 work

Why are the outputs not the same?

Context: Instead of ls I actually would like to run poetry run ... but poetry gives the following error message for the parallel() processes: Poetry could not find a pyproject.toml file in /home/alex/my_cool_repo/work/f3/766313fbc5d6aeeb39f19193956ffd or its parents.


Solution

  • As user dbthorbur points out in his comment, the difference has to do with the directories mounted into your container.

    For your first process scatter you are using an additional file-input that is located somewhere else on your machine. So nextflow needs to mount that location AND your work-directory into the container used for scatter. Apparently it takes a common root(?) directory of both, so that you find some additional files.

    The second process parallel on the other hand only takes input from work, so only that directory gets mounted as volume for your container.

    Check out your .command.run scripts in the work-directories to see what actually gets mounted by docker (or podman?).

    There are two ways to overcome the difference.

    • Use stageInMode "copy" as directive for scatter to get the behaviour of parallel in both processes or
    • use containerOptions "-v /home/alex/my_cool_repo:/home/alex/my_cool_repo" directive in parallelto get the current behaviour of scatter in both