How to properly use pinned memory in ArrayFire?

When using pinned memory in ArrayFire I get slow performance.

I've tried various methods of creating pinned memory and creating arrays from it, eg. cudaMallocHost. Using cudaMallocHost w/ cudaMemcpy ways pretty fast (several hundred usec.), but then creating/initializing the arrayfire array was really slow (~ 2-3 sec.). Finally I came up with the following method and the allocation takes ~ 2-3 sec., but it can be moved elsewhere. Initializing the array with the host data is satisfactory (100 - 200 usec.), but now the operations (FFT in this case) are excruciatingly slow: ~ 400 msec. I should add the input signal is variable in size, but for the timing I've used 64K samples (complex doubles). Also, I'm not providing my timing function for brevity, but it isn't the problem, I've timed using other methods and the results are consistent.

// Use the Frequency-Smoothing method to calculate the full 
// Spectral Correlation Density
// currently the whole function takes ~ 2555 msec. w/ signal 64K samples
// and window_length = 400 (currently not implemented)
void exhaustive_fsm(std::vector<std::complex<double>> signal, uint16_t window_length) {

  // Allocate pinned memory (eventually move outside function)
  // 2192 ms.
  af::af_cdouble* device_ptr = af::pinned<af::af_cdouble>(signal.size());

  // Init arrayfire array (eventually move outside function)
  // 188 us.
  af::array s(signal.size(), device_ptr, afDevice);

  // Copy to device
  // 289 us.
  s.write((af::af_cdouble*) signal.data(), signal.size() * sizeof(std::complex<double>), afHost);

  // FFT
  // 351 ms. equivalent to:
  // af::array fft = af::fft(s, signal.size());
  af::array fft = zrp::timeit(&af::fft, s, signal.size());
  fft.eval();

  // Convolution

  // Copy result to host

  // free memory (eventually move outside function)
  // 0 ms.
  af::freePinned((void*) s.device<af::af_cdouble>());

  // Return result
}

As I said above the FFT is taking ~ 400 msec. This function using Armadillo takes ~ 110 msec. including the convolution, the FFT using FFTW takes about 5 msec. Also on my machine using the ArrayFire FFT example I get the following results (modified to use c64)

            A             = randu(1, N, c64);)

Benchmark 1-by-N CX fft

   1 x  128:                    time:     29 us.
   1 x  256:                    time:     31 us.
   1 x  512:                    time:     33 us.
   1 x 1024:                    time:     41 us.
   1 x 2048:                    time:     53 us.
   1 x 4096:                    time:     75 us.
   1 x 8192:                    time:    109 us.
   1 x 16384:                   time:    179 us.
   1 x 32768:                   time:    328 us.
   1 x 65536:                   time:    626 us.
   1 x 131072:                  time:   1227 us.
   1 x 262144:                  time:   2423 us.
   1 x 524288:                  time:   4813 us.
   1 x 1048576:                 time:   9590 us.

So the only difference I can see is the use of pinned memory. Any idea where I'm going wrong? Thanks.

EDIT

I noticed when running the AF FFT eaxample there is a significant delay before printing out the 1st time (even though the time doesn't include this delay). So I decided to make a class and move all of the allocations/deallocations into the ctor/dtor. Out of curiosity I also put an FFT in the ctor, because I also noticed if I ran a second FFT it took ~ 600 usec. consistent w/ my benchmarks. Sure enough running a "preliminary" FFT seems to "initialize" something and subsequent FFT's run much faster. There has to be a better way, I must be missing something.

Solution

I am pradeep and one of the developers of ArrayFire.

Firstly, all ArrayFire functions(CUDA & OpenCL) backends, have some startup cost which includes device warmup and/or kernel caching(kernels are cached the first time a particular function is invoked). This is the reason, you are noticing better run times after the first run. This is also the reason, we almost always strongly recommend using our in-built timeit function to time arrayfire code as it averages out over a set of runs rather than using the first run.

As you already surmised from your experiments, it is always better to keep pinned memory allocations in a controlled way. If you are not already aware of the trade-offs involved while using pinned memory, you can start with this blog post from NVIDIA(It equally applies to pinned memory from OpenCL backend, with any vendor specific limitations of course). The general guideline as suggested in the hyperlinked post is as follows:

You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system and other programs. How much is too much is difficult to tell in advance, so as with all optimizations, test your applications and the systems they run on for optimal performance parameters.

If possible, the following is the route I would take to use pinned memory for your FFTs

Encapsulate pinned allocations/frees into RAII format, which you are already doing now from your edited description.
Do the pinned memory allocation only once if possible - if your data size is static.

Apart from these, I think your function is incorrect in couple of ways. I will go over the function in line order.

af::af_cdouble* device_ptr = af::pinned(signal.size());

This call doesn't allocate memory on device/GPU. It is page-locked memory on host, RAM.

af::array s(signal.size(), device_ptr, afDevice);

Since, af::pinned doesn't allocate device memory, it is not device pointer and enum is afHost. So, the call would be af::array s(signal.size(), ptr);

You are using s.write correctly by itself, but I believe it is not needed in your use case.

The following what I would do.

Use RAII construct for the pointer returned by af::pinned and allocate it only once. Be sure you don't have too many of these page-locked allocations.
Use the page-locked allocation as your regular host allocation instead of std::vector<complex> because this is host memory, just page-locked. This would involve writing some extra code on your host side if you are operating on std::vector in some fashion. Otherwise, you can just use the RAIIed-pinned-pointer to store your data.
All, you need to do transfer your fft data to device is af::array s(size, ptr)

At this, the operations you would have to time are transfer from pinned-memory to GPU, the last call in above list; fft execution; copy back to host.