c cuda bit-manipulation gpgpu gpu-shared-memory

How to efficiently perform load and bitwise operation using GPGPU?

I need to load an array of 128 to 256 bytes length to GPU shared memory.

I want to minimize global memory access while performing bit-wise operations efficiently.

I have a 256 byte array loaded to global memory and I want load all the 256 bytes to shared memory at the beginning of the kernel code.

Then, on each byte a bit-wise operation has to be performed while this operation can apply on two 1 byte variables, two 2 byte variables or two 4 byte variables and bigger.

The questions are:

If I load the 16 bytes of the array per thread then I have just one memory access for (warp size)*(16bytes) data or the best way is 4 bytes per thread?
If I load 16 bytes per thread into shared memory then will I OR these 16 bytes with a same 16 bytes variable to set a bit to 1?
If I load 16 bytes per thread then perform an OR operation on a 16 bytes data type, is it faster to use a 4 byte data type or smaller?

For example I want to set the 3rd bit to 1:

__shared__ (which data type?) temp = ((which data type?) *)array[i];
temp |= (a variable with third bit set to 1)

array is in global memory and I want to load it to shared memory by the minimum amount of global memory accesses.

Solution

Match the size of the memory transaction to the bank size. Most of the time, the bank size is 32-bits. Starting with Kepler, you could specify a bank size of 64 bits by calling cudaDeviceSharedMemConfig()

Acceleware has a good resource on the topic:

http://acceleware.com/blog/maximizing-shared-memory-bandwidth-nvidia-kepler-gpus