I need to load an array of 128 to 256 bytes length to GPU shared memory.
I want to minimize global memory access while performing bit-wise operations efficiently.
I have a 256 byte array loaded to global memory and I want load all the 256 bytes to shared memory at the beginning of the kernel code.
Then, on each byte a bit-wise operation has to be performed while this operation can apply on two 1 byte variables, two 2 byte variables or two 4 byte variables and bigger.
The questions are:
For example I want to set the 3rd bit to 1
__shared__ (which data type?) temp = ((which data type?) *)array[i];
temp |= (a variable with third bit set to 1)
is in global memory and I want to load it to shared memory by the minimum amount of global memory accesses.
Match the size of the memory transaction to the bank size. Most of the time, the bank size is 32-bits. Starting with Kepler, you could specify a bank size of 64 bits by calling cudaDeviceSharedMemConfig()
Acceleware has a good resource on the topic: