Search code examples
ccudamemory-accessgpu-shared-memory

how to efficiently load a big array to GPU shared memory?


I want to load a big array to GPU shared memory. when I employ that just like bellow:

  1. int index = threadidx.x;
    
  2. shared unsigned char x[1000];
    
  3. x[i] = array[i];
    

Then if we call a kernel code with 1000 threads and one block, for every thread a memory access will occur?

Is it possible to load this array by a single memory access and store it to shared memory?

Any suggestion would be greatly appreciated.


Solution

  • No it can't be done with a single access.

    Using threads in parallel to load shared memory, just as you have shown, is the fastest way. Shared memory can only be loaded by memory operations performed by threads in CUDA kernels. There are no API functions to load shared memory.

    If you have an array that is larger than the number of threads in a threadblock, you can use a looping approach like the one outlined here.