Long cuMemToHostAlloc call after exiting a kernel with copyout

I am accelerating a Fortran code with OpenACC. When I profile the program with NVIDIA Nsight, I noticed the first call of a kernel with a copyout clause exhibited a long call to cuMemToHostAlloc.

Here is a trivial example illustrating this. The program launches successively 10 times a kernel that computes an array test and returns its value:

program test
  implicit none
  real, allocatable :: test(:)
  integer :: i, j, n, m

  n = 1000
  m = 10
  allocate(test(n))

  do j = 1, m
    !$acc kernels copyout(test)
    !$acc loop independent
    do i = 1, n
      test(i) = real(i)
    end do
    !$acc end kernels
  end do

  deallocate(test)
end program test

The code is compiled with NVHPC 22.7, using no optimization flag (adding such flags did not have any influence). The profiling of the code gives:

Compared to the actual memory transfer time, as seen for the 9 other calls, the call to cuMemToHostAlloc is ridiculously long. If I remove the copyout clause, the call to cuMemToHostAlloc disappears, so this is related to copying back data from the device, but I do not understand why it only happens once and for so long. Also, the test array is already allocated on the host memory. Am I missing something?

Solution

It's the call to create the pinned memory buffers used to transfer the data between the host and device. DMA transfer must use non-swappable, i.e. pinned, memory.

We use a double buffering system where as one buffer is being filled with the virtual memory, the second buffer is transferred asynchronously to the device. Effectively hiding much of the virtual to pinned memory copy.

The host pinned memory allocation is relatively expensive but only occurs once when the runtime first encounters a data region so the cost will be amortized.

Note by removing the copyout, you're removing the need to transfer the data and hence no need for the buffers.