Pass a constant shared memory

I have a vector with length 128; All elements are constant all throughout computation.

I like to use this constant vector in my CUDA kernel. I am thinking of storing this vector in shared memory, and use it in the kernel. I am wondering how to to that? Several lines of code would be nice.

Or is this the best way to do it? Thanks a lot.

Of the top of the head we can pass with global memory:

__global__ void fun(float* a, float* coeff)
{
  size_t
        i = blockIdx.x * blockDim.x + threadIdx.x;

  if (i >= 128)
       return;

  a[i] *= coeff[i];
}

but that is probably not the best way. I imagine something like

__shared__ float coeff[128];

But how do I copy CPU values to this shared memory? And I do I pass this shared memory to my kernel?

Solution

__shared__ memory can't be directly accessed from host code. So you must pass data to it via global memory first, and then from there copy it (using kernel code) into __shared__ space.

A simple modification to your kernel code to demonstrate the concept would look like this:

__global__ void fun(float* a, float* coeff)
{
  __shared__ float scoeff[128];
  size_t
        i = blockIdx.x * blockDim.x + threadIdx.x;

  if (i >= 128)
       return;
  scoeff[i] = coeff[i];
  __syncthreads();

  a[i] *= scoeff[i];
}

Notes:

There are numerous CUDA sample codes that demonstrate more advanced usage of shared memory, like e.g. 6_Performance/transpose.
The usage here would provide no benefit. Shared memory is normally used in situations which require inter-thread communication, or else in situations where there is data reuse. Your code demonstrates neither.
There are a number of other ways to provide constant values to a kernel, including arrays of constants, such as __constant__ memory. Whether or not any of these would be beneficial would depend very much on your actual use case and access patterns, which I assume are not represented by the code you have shown. In any event there are quite a few questions here on the CUDA tag that discuss various kinds of constant data usage, which I'm sure you can find with a bit of searching.
The __syncthreads() is arguably not necessary for this code. But it is necessary in a great many more typical uses of shared memory, so I've chosen to point it out here. In this particular code, it is not necessary, but this particular code would also not be a sensible usage of shared memory.