CUDA: How does Thrust manage memory when using a Comparator in a sorting function?

I have a char array of 10 characters that I would like to pass as an argument to a comparator which will be used by Thrust's sorting function.

In order to allocate memory for this array I use cudaMalloc. However cudaMalloc allocates memory in the global memory, so whenever a thread wants to read data from this array it will have to access the global memory.

but this array is small and I believe it would be more efficient if it was stored into some shared memory or even in the registers of each thread. However is it possible to achieve this with Thrust and if yes how?

Here is the comparator:

struct comp{
   int *data_to_sort;
   char *helpingArray;

   comp(int *data_ptr) this->data_to_sort = data_ptr;

   __host__ __device__
      bool operator()(const int&a, const int&b){

            //use helpingArray to do some comparisons and
           // return true/false accordingly

      }
 }

then I allocate memory for the helpingArray in the global memory and pass it as an argument with the Comparator struct to the sorting function.

Note that the data_to_sort array is stored in the global memory because it contains the data that needs to be sorted, we can't avoid this from happening.

This works fine, and the sorting method is faster than the cpu sorting method, however I believe if I avoid storing the helpingArray in the global memory the sorting method will become much faster.

Solution

I would agree that putting helpingArray into global memory makes little sense and is reducing performance to at least some degree. The thrust back-end which executes kernels is "closed" and doesn't expose kernel level features like shared memory or registers, so those can't be directly used.

Having said that, there are probably two things you can do to improve this. The first would be to declare your functor like this:

struct comp{
   char helpingArray[10];

   __host__ __device__
      bool operator()(const int&a, const int&b){ ... }
 }

You can populate helpingArray in host code before passing the functor to thrust algorithm you are using (note that the functor is passed by value so this is perfectly legal). In such a case, helpingArray probably winds up in thread local memory. There may or may not be a performance improvement in doing so. Certainly it greatly simplifies the host code required to support things.

The other alternative is to declare helpingArray in __constant__ memory and just refer to it inside the functor. If the access pattern of each thread is uniform, then there could be a performance improvement in doing so because of the constant cache.