prefix sum using CUDA

I am having trouble understanding a cuda code for naive prefix sum.

This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html In example 39-1 (naive scan), we have a code like this:

 __global__ void scan(float *g_odata, float *g_idata, int n)
    {
    extern __shared__ float temp[]; // allocated on invocation
    int thid = threadIdx.x;
    int pout = 0, pin = 1;
    // Load input into shared memory.
    // This is exclusive scan, so shift right by one
    // and set first element to 0
    temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
 __syncthreads();

 for (int offset = 1; offset < n; offset *= 2)
  {
    pout = 1 - pout; // swap double buffer indices
    pin = 1 - pout;
    if (thid >= offset)
      temp[pout*n+thid] += temp[pin*n+thid - offset];
    else
      temp[pout*n+thid] = temp[pin*n+thid];
    __syncthreads();
  }
  g_odata[thid] = temp[pout*n+thid1]; // write output
}

My questions are

Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:

for d = 1 to log2 n do for all k in parallel do if k >= 2^d then x[k] = x[k – 2^(d-1)] + x[k]

This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...

Solution

1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.

2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:

temp[pout*n+thid] += temp[pin*n+thid - offset];

First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1. It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.

3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.

Hope it helps.