Search code examples
openmpiopenaccucx

How to enable CUDA Aware OpenMPI?


I'm using OpenMPI and I need to enable CUDA aware MPI. Together with MPI I'm using OpenACC with the hpc_sdk software.

Following https://www.open-mpi.org/faq/?category=buildcuda I downloaded and installed UCX (not gdrcopy, I haven't managed to install it) with

./contrib/configure-release --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC=pgcc CXX=pgc++ --disable-fortran

and it prints:

checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
checking cuda_runtime.h usability... yes
checking cuda_runtime.h presence... yes
checking for cuda_runtime.h... yes

So UCX seems to be ok. After this I re-configured OpenMPI with:

./configure --with-ucx=/home/marco/Downloads/ucx-1.9.0/install --with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/20.7/cuda/11.0 CC=pgcc CXX=pgc++ --disable-mpi-fortran

and it prints:

CUDA support: yes
Open UCX: yes

If I try to run the application with: mpirun -np 2 -mca pml ucx -x ./a.out (as suggested on openucx.org) I get the errors:

match_arg (utils/args/args.c:163): unrecognized argument mca
HYDU_parse_array (utils/args/args.c:178): argument matching returned error
parse_args (ui/mpich/utils.c:1642): error parsing input array
HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
main (ui/mpich/mpiexec.c:148): error parsing parameters

I see that the directories the compilers is looking for are not the OpenMPI ones but the ones of MPICH, I don't know why. If i type which mpicc ,which mpiexec and which mpirun I get the ones of OpenMPI.

If i run with: mpiexec -n 2 ./a.out I get:

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

EDITED:

Doing the same but using the OpenMPI-4.0.5 that comes with NVIDIA HPC SDK it compiles and runs fine.

I get:

[marco-Inspiron-7501:1356251:0:1356251] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f05cfafa000)
==== backtrace (tid:1356251) ====
 0  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7f060ae06dc7]
 1  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7f060ae06b87]
 2  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7f060ae06ce4]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f060c7433c0]
 4  /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7f060befb885]
 5  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7f060b2bd9e6]
 6  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7f060b2bd775]
 7  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7f060b2d35b5]
 8  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7f060b2d111d]
 9  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7f060b055577]
10  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7f060b054725]
11  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7f060b2d3614]
12  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7f060b2d22c7]
13  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7f060b2d15b1]
14  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7f060b2e85bd]
15  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7f060b2e7d15]
16  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7f060b2e721a]
17  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7f060b2e65ac]
18  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7f060dfc3b33]
=================================
[marco-Inspiron-7501:1356252:0:1356252] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd7f7afa000)
==== backtrace (tid:1356252) ====
 0  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(ucs_handle_error+0x67) [0x7fd82a711dc7]
 1  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ab87) [0x7fd82a711b87]
 2  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucs.so.0(+0x2ace4) [0x7fd82a711ce4]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fd82c04e3c0]
 4  /lib/x86_64-linux-gnu/libc.so.6(+0x18e885) [0x7fd82b806885]
 5  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x379e6) [0x7fd82abc89e6]
 6  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_dt_pack+0xa5) [0x7fd82abc8775]
 7  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d5b5) [0x7fd82abde5b5]
 8  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b11d) [0x7fd82abdc11d]
 9  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(+0x1b577) [0x7fd82a960577]
10  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x75) [0x7fd82a95f725]
11  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4d614) [0x7fd82abde614]
12  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4c2c7) [0x7fd82abdd2c7]
13  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x4b5b1) [0x7fd82abdc5b1]
14  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x625bd) [0x7fd82abf35bd]
15  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x61d15) [0x7fd82abf2d15]
16  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(+0x6121a) [0x7fd82abf221a]
17  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libucp.so.0(ucp_tag_send_nbx+0x5ec) [0x7fd82abf15ac]
18  /opt/nvidia/hpc_sdk_209/Linux_x86_64/20.9/comm_libs/openmpi4/openmpi-4.0.5/lib/libmpi.so.40(mca_pml_ucx_send+0x1a3) [0x7fd82d8ceb33]
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node marco-Inspiron-7501 exited on signal 11 (Segmentation fault).

The error it's caused by pragma acc host_data use_device(send_buf, recv_buf)

  double send_buf[NX_GLOB + 2*NGHOST];
  double recv_buf[NX_GLOB + 2*NGHOST];

  #pragma acc enter data create(send_buf[:NX_GLOB+2*NGHOST], recv_buf[NX_GLOB+2*NGHOST])

  // Top buffer
  j = jend;
  #pragma acc parallel loop present(phi[:ny_tot][:nx_tot], send_buf[:NX_GLOB+2*NGHOST])
  for (i = ibeg; i <= iend; i++) send_buf[i] = phi[j][i];
  #pragma acc host_data use_device(send_buf, recv_buf)
  {
  MPI_Sendrecv (send_buf, iend+1, MPI_DOUBLE, procR[1], 0,
                recv_buf, iend+1, MPI_DOUBLE, procR[1], 0,
                MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  }

Solution

  • This was an issue in the 20.7 release when adding UCX support. You can lower the optimization level to -O1 work around the problem, or update your NV HPC compiler version to 20.9 where we've resolved the issue.

    https://developer.nvidia.com/nvidia-hpc-sdk-version-209-downloads