Search code examples
bashparallel-processinggnupbscray

How to use GNU parallel (bash scripting) with aprun command on Cray XE6 compute nodes (Unix like env)?


I am trying to run 16 instances on mpi4py python script: hello.py. I stored in s.txt 16 commands of this sort:

python /lustre/4_mpi4py/hello.py > 01.out

I am submitting this in Cray cluster via aprun command like this:

aprun -n 32 sh -c 'parallel -j 8 :::: s.txt'

My intention was to run 8 of those python jobs per node at the time.The script was running more than 3 hours and none of *.out files was created. From PBS scheduler output file I am getting this:

Python version 2.7.3 loaded
aprun: Apid 11432669: Caught signal Terminated, sending to application
aprun: Apid 11432669: Caught signal Terminated, sending to application
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 02.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 06.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 10.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 08.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out

I am running this on one node and it has 32 cores. I suppose my use of GNU parallel command is wrong. Can someone please help with this.


Solution

  • As listed in https://portal.tacc.utexas.edu/documents/13601/1102030/4_mpi4py.pdf#page=8

    from mpi4py import MPI
    
    comm = MPI . COMM_WORLD
    
    print " Hello ! I’m rank %02d from %02 d" % ( comm .rank , comm . size )
    
    print " Hello ! I’m rank %02d from %02 d" % ( comm . Get_rank () ,
    comm . Get_size () )
    
    print " Hello ! I’m rank %02d from %02 d" %
    ( MPI . COMM_WORLD . Get_rank () , MPI . COMM_WORLD . Get_size () )
    

    your 4_mpi4py/hello.py program is not typical single process (or single python script), but multi-process MPI application.

    GNU parallel expects simpler programs and don't support interaction with MPI processes.

    In your cluster there are many nodes and every node may start different number of MPI processes (with 2 of 8-core CPU per node think about variants: 2 MPI processes of 8 OpenMP threads each; 1 MPI process of 16 threads; 16 MPI processes without threads). And to describe the slice of cluster to your task there is some interface between cluster management software and the MPI library used by python MPI wrapper used by your script. And the management is the aprun (and qsub?):

    http://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/aprun-man-page/

    https://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/

    You must use the aprun command to launch jobs on the Hopper compute nodes. Use it for serial, MPI, OpenMP, UPC, and hybrid MPI/OpenMP or hybrid MPI/CAF jobs.

    https://wickie.hlrs.de/platforms/index.php/CRAY_XE6_Using_the_Batch_System

    The job launcher for the XE6 parallel jobs (both MPI and OpenMP) is aprun. ... The aprun example above will start the parallel executable "my_mpi_executable" with the arguments "arg1" and "arg2". The job will be started using 64 MPI processes with 32 processes placed on each of your allocated nodes (remember that a node consists of 32 cores in the XE6 system). You need to have nodes allocated by the batch system before (qsub).

    There is some interface between aprun and qsub and MPI: in normal start (aprun -n 32 python /lustre/4_mpi4py/hello.py) aprun just starts several (32) processes of your MPI program, sets the id of each process in the interface and gives them the group id (for example, with environment variables like PMI_ID; actual vars are specific to launcher/MPI lib combination).

    GNU parallel have no any interface to MPI programs, it know nothing about such variables. It will just start 8 times more processes than expected. And all 32 * 8 processes in your incorrect command will have same group id; and there will be 8 processes with same MPI process id. They will make your MPI library to misbehave.

    Never mix MPI resource managers / launchers with ancient before-the-MPI unix process forkers like xargs or parallel or "very-advanced bash scripting for parallelism". There is MPI for doing something parallel; and there is MPI launcher/job management (aprun, mpirun, mpiexec) for starting several processes / forking / ssh-ing to machines.

    Don't do aprun -n 32 sh -c 'parallel anything_with_MPI' - this is unsupported combination. Only possible (allowed) argument to aprun is program of some supported parallelism like OpenMP, MPI, MPI+OpenMP or non-parallel programs. (or single script of starting ONE parallel program)

    If you have several independent MPI tasks to start, use several arguments to aprun: aprun -n 8 ./program_to_process_file1 : -n 8 ./program_to_process_file2 -n 8 ./program_to_process_file3 -n 8 ./program_to_process_file4

    If you have multiple files to work on, try to start many parallel jobs, use not single qsub, but several and allow PBS (or which job manager is used) to manage your jobs.

    If you have very high number of files, try not to use MPI in your program (don't ever link MPI libs / include MPI headers) and use parallel or other form of ancient parallelism, which is hidden from aprun. Or use single MPI program and program file distribution directly in your code (Master process of MPI may open file list, then distribute files between other MPI processes - with or without dynamic process management of MPI / mpi4py: http://pythonhosted.org/mpi4py/usrman/tutorial.html#dynamic-process-management).

    Some scientists tries to combine MPI and parallel in other sequence: parallel ... aprun ... or parallel ... mpirun ...: