Search code examples
cudagpunvidiagpu-shared-memory

In CUDA, what instruction is used to load data from global memory to shared memory?


I am currently studying CUDA and learned that there are global memory and shared memory.

I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/st.shared and ld.global/st.global instructions, respectively.

What I am curious about is what instruction is used to load data from global memory to shared memory?

It would be great if someone could let me know.

Thanks!

__global__ void my_function(int* global_mem)
{
    __shared__ int shared_mem[10];
    for(int i = 0; i < 10; i++) {
        shared_mem[i] = global_mem[i];  // What instrcuton is used for this load operation?
    }
}

Solution

  • In the case of

    __shared__ float smem[2];
    smem[0] = global_memory[0];
    

    Then the operation is (in SASS)

    LDG  Rx, [Ry]
    STS  [Rz], Rx
    

    To expand a bit more, read https://forums.developer.nvidia.com/t/whats-different-between-ld-and-ldg-load-from-generic-memory-vs-load-from-global-memory/40856/2

    Summary:

    instruction meaning
    LDS load from shared space
    LDC load from constant space
    LDG load from global space
    LD generic load - space deduced from the supplied address