creating 16-bit input to cufftXtMakePlanMany and workSize for 1 GPU

I need to compute FFT on unsigned int 8bit data. Previously, I was using cufftPlanMany and my input was cufftReal and the output was cufftComplex, and I was using casting before and after FFT to convert from unsigned 8bit to cufftReal and then from cufftComplex to signed 8bit.

It came to my attention that cuFFT has a nice option to run FFT for half-precision data which I hope improves the running time. According to the documentation, it currently doesn't support all of cudaDataType (that would be wonderful if it can in the future), but at least I can run it with 16bit float (half-precision) with the following signature:

cufftResult
cufftXtMakePlanMany(cufftHandle plan, int rank, long long int *n, long long int *inembed,
    long long int istride, long long int idist, cudaDataType inputtype,
    long long int *onembed, long long int ostride, long long int odist,
    cudaDataType outputtype, long long int batch, size_t *workSize,
    cudaDataType executiontype);

with data types for input, output and execution respectively as: CUDA_R_16F, CUDA_C_16F and CUDA_C_16F. Tha twould be ideal that I can feed this cuFFT with my U8 data, is there any way for doing so? Otherwise, if the first casting from U8 to cufftReal is necessary how can I convert my data from cufftReal to CUDA_R_16F and then from CUDA_C_16F ? Is cuda smart enough to cast the input from float to half-precision data, because cufftExecR2C ultimaltey would be the same and there is no other function to be called for the half-precision?

The other question is about workSize which is designed for multiple GPU cases. Any idea how this size has to be calculated? (I have just 1 GPU). Am I responsible for managing that buffer?

Solution

TL;DR: I can see two possible approaches here, one using a half-precision transform and one using a single-precision transform (perhaps with CUFFT callbacks). The reasons to choose one or the other may depend on a number of factors such as size of your transform, control of scope of input data, the GPU you are running on, and other factors.

I'm not going to try to address the processing of the output data that you indicate here:

then from cufftComplex to signed 8bit.

since I don't know how to do that without more information. However the processing of the input data in each case should be illustrative for how you could process the output data.

Using half-precision transforms

A few things to note here are that you cannot (currently) use callbacks with half-precision transforms, and half-precision transforms can be more sensitive to input data characteristics (e.g. DC offset, transform size, etc.) than single or double-precision transforms. Also, half-precision transforms for the most part require a pascal or newer GPU (ignoring Jetson family).

Because half-precision transforms don't support callbacks, we'll use "ordinary" host code to process the input data; you could also do this processing on the device prior to the transform, the provided code outlines both possibilities. My "preprocessing" here is mostly just designed to prevent the 16-bit transform from overflowing. If you play around with this code you'll quickly see what an overflow looks like (inf and/or nan in the output).

$ cat t1961.cu
#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
#include <cufftXt.h>
#include <cuda_fp16.h>
#include <assert.h>
#include <iostream>

typedef half2 ctype;
typedef half rtype;
typedef unsigned char dtype;
long long sig_size = 1<<18;
const int amplitude = 127;
const float ramplitude = 1/(float)(4*amplitude);

__host__ __device__ half convert(int val){
  return __float2half_rn((val - amplitude)*ramplitude);
}
__global__ void dev_convert(rtype *out, dtype *in, int sz){
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < sz)
    out[idx] = convert(in[idx]);
}

int main(){
  //put 4x sine waves into a U8 array
  dtype *my_data = (dtype *)malloc(sig_size*sizeof(dtype));
  for (int i = 0; i < sig_size; i++) my_data[i] = amplitude*(sin((i*8*3.141592654f)/sig_size)+1.0);
  rtype *d_idata;
  ctype *d_odata;
  cudaMalloc(&d_idata, sizeof(rtype)*sig_size);
#ifdef USE_HOST
  rtype *h_idata = (rtype *)malloc(sig_size*sizeof(rtype));
  // convert to 16 bit float non-offset suitable for cufft
  for (int i = 0; i < sig_size; i++) h_idata[i] = convert(my_data[i]);
  cudaMemcpy(d_idata, h_idata, sig_size*sizeof(rtype), cudaMemcpyHostToDevice);
#else
  const int bs = 256;
  dtype *d_mydata;
  cudaMalloc(&d_mydata, sig_size*sizeof(dtype));
  cudaMemcpy(d_mydata, my_data, sig_size*sizeof(dtype), cudaMemcpyHostToDevice);
  dev_convert<<<(sig_size+bs-1)/bs, bs>>>(d_idata, d_mydata, sig_size);
#endif
  cudaMalloc(&d_odata, sizeof(ctype)*(sig_size/2+1));
  cufftHandle plan;
  cufftResult r;
  r = cufftCreate(&plan);
  assert(r == CUFFT_SUCCESS);
  size_t ws = 0;
  r = cufftXtMakePlanMany(plan, 1,  &sig_size, NULL, 1, 1, CUDA_R_16F, NULL, 1, 1, CUDA_C_16F, 1, &ws, CUDA_C_16F);
  assert(r == CUFFT_SUCCESS);
  r = cufftXtExec(plan, d_idata, d_odata, CUFFT_FORWARD); // warm-up
  assert(r == CUFFT_SUCCESS);
  cudaEvent_t start, stop;
  cudaEventCreate(&start); cudaEventCreate(&stop);
  cudaEventRecord(start);
  r = cufftXtExec(plan, d_idata, d_odata, CUFFT_FORWARD);
  assert(r == CUFFT_SUCCESS);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float et;
  cudaEventElapsedTime(&et, start, stop);
  printf("forward FFT time for %ld samples: %fms\n", sig_size, et);
  ctype *h_odata = (ctype *)malloc((sig_size/2+1)*sizeof(ctype));
  cudaMemcpy(h_odata, d_odata, (sig_size/2+1)*sizeof(ctype), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 8; i++)
    std::cout << __half2float(h_odata[i].x) << " + " << __half2float(h_odata[i].y) << "i" << std::endl;
  return 0;
}
$ nvcc -o t1961 t1961.cu -lcufft
$ ./t1961
forward FFT time for 262144 samples: 0.027520ms
-258 + 0i
0.00349998 + 0.00127506i
-0.000146866 + -0.000833511i
0.00140095 + -0.00501251i
-1.57031 + -32752i
-0.00198174 + 0.00856018i
0.00474548 + 0.00359917i
-0.00226784 + 0.00987244i
$

Using a single precision transform with a load callback

This in my view has a few benefits. It is not as subject to the overflow phenomenon as the half precision transforms are, and the (load) callback routine allows us to still operate on U8 input data.

$ cat t1962.cu
#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
#include <cufftXt.h>
#include <cuda_fp16.h>
#include <assert.h>
#include <iostream>

typedef cufftComplex ctype;
typedef cufftReal rtype;
typedef unsigned char dtype;
long long sig_size = 1<<18;
const int amplitude = 127;
const cufftReal ramplitude = 1/(float)(4*amplitude);

__device__ rtype convert(int val){
  return (val - amplitude)*ramplitude;
}

 __device__ rtype myOwnCallback(void *dataIn,
                                     size_t offset,
                                     void *callerInfo,
                                     void *sharedPtr) {
     rtype ret;
     ret = convert(((dtype *)dataIn)[offset]);
     return ret;
 }
 __device__ cufftCallbackLoadR myOwnCallbackPtr = myOwnCallback;

int main(){
  cufftCallbackLoadR hostCopyOfCallbackPtr;

  cudaMemcpyFromSymbol(&hostCopyOfCallbackPtr,
                     myOwnCallbackPtr,
                     sizeof(hostCopyOfCallbackPtr));
  //put 4x sine waves into a U8 array
  dtype *my_data = (dtype *)malloc(sig_size*sizeof(dtype));
  for (int i = 0; i < sig_size; i++) my_data[i] = amplitude*(sin((i*8*3.141592654f)/sig_size)+1.0);
  ctype *d_odata;
  dtype *d_mydata;
  cudaMalloc(&d_mydata, sig_size*sizeof(dtype));
  cudaMemcpy(d_mydata, my_data, sig_size*sizeof(dtype), cudaMemcpyHostToDevice);
  cudaMalloc(&d_odata, sizeof(ctype)*(sig_size/2+1));
  cufftHandle plan;
  cufftResult r;
  r = cufftCreate(&plan);
  assert(r == CUFFT_SUCCESS);
  size_t ws = 0;
  r = cufftXtMakePlanMany(plan, 1,  &sig_size, NULL, 1, 1, CUDA_R_32F, NULL, 1, 1, CUDA_C_32F, 1, &ws, CUDA_C_32F);
  assert(r == CUFFT_SUCCESS);
  void *rps[] = {(void *)hostCopyOfCallbackPtr};
  r = cufftXtSetCallback(plan, rps, CUFFT_CB_LD_REAL, NULL);
  assert(r == CUFFT_SUCCESS);
  r = cufftXtExec(plan, (cufftReal *)d_mydata, d_odata, CUFFT_FORWARD); // warm-up
  assert(r == CUFFT_SUCCESS);
  cudaEvent_t start, stop;
  cudaEventCreate(&start); cudaEventCreate(&stop);
  cudaEventRecord(start);
  r = cufftXtExec(plan, (cufftReal *)d_mydata, d_odata, CUFFT_FORWARD);
  assert(r == CUFFT_SUCCESS);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float et;
  cudaEventElapsedTime(&et, start, stop);
  printf("forward FFT time for %ld samples: %fms\n", sig_size, et);
  ctype *h_odata = (ctype *)malloc((sig_size/2+1)*sizeof(ctype));
  cudaMemcpy(h_odata, d_odata, (sig_size/2+1)*sizeof(ctype), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 8; i++)
    std::cout << h_odata[i].x << " + " << h_odata[i].y << "i" << std::endl;
  return 0;
}
$ nvcc -o t1962 t1962.cu -rdc=true -lcufft_static -lculibos
$ ./t1962
forward FFT time for 262144 samples: 0.031488ms
-257.969 + 0i
0.00344251 + 0.00137726i
-3.96543e-05 + -0.00106905i
0.0013994 + -0.00490045i
0.0331312 + -32759.4i
-0.00190887 + 0.00865401i
0.00454092 + 0.00368094i
-0.00219025 + 0.00983646i
$

Yes, the results are not numerically identical between the two transform types. It's not reasonable to expect that 16-bit floating point calculations and 32-bit floating point calculations will be identical. In all probability the 32-bit calculations are "more accurate". For this sinewave case, the terms I consider most important are the DC term as well as the magnitude spike at the fundamental. Those are numerically close to each other. The other terms are "in the noise". The timing results are not exactly comparable either, as the 16-bit calculation case omits the cost of the kernel call to convert the data from U8 to F16. You can use a profiler or just refactor the code to get more comparable timing.

workSize can be ignored for the single GPU case when using cufftXtMakePlanMany, otherwise, use the provided routines to determine workSize.