Search code examples
ccudabit-manipulationgpgpugpu-shared-memory

How to efficiently perform load and bitwise operation using GPGPU?


I need to load an array of 128 to 256 bytes length to GPU shared memory.

I want to minimize global memory access while performing bit-wise operations efficiently.

I have a 256 byte array loaded to global memory and I want load all the 256 bytes to shared memory at the beginning of the kernel code.

Then, on each byte a bit-wise operation has to be performed while this operation can apply on two 1 byte variables, two 2 byte variables or two 4 byte variables and bigger.

The questions are:

  1. If I load the 16 bytes of the array per thread then I have just one memory access for (warp size)*(16bytes) data or the best way is 4 bytes per thread?
  2. If I load 16 bytes per thread into shared memory then will I OR these 16 bytes with a same 16 bytes variable to set a bit to 1?
  3. If I load 16 bytes per thread then perform an OR operation on a 16 bytes data type, is it faster to use a 4 byte data type or smaller?

For example I want to set the 3rd bit to 1:

__shared__ (which data type?) temp = ((which data type?) *)array[i];
temp |= (a variable with third bit set to 1)

array is in global memory and I want to load it to shared memory by the minimum amount of global memory accesses.


Solution

  • Match the size of the memory transaction to the bank size. Most of the time, the bank size is 32-bits. Starting with Kepler, you could specify a bank size of 64 bits by calling cudaDeviceSharedMemConfig()

    Acceleware has a good resource on the topic:

    http://acceleware.com/blog/maximizing-shared-memory-bandwidth-nvidia-kepler-gpus