CUDA: Stack vector in different thread to a 1d vector

I have a thrust vector for each thread in CUDA, and I want to stack vectors by orders (vector in thread 0, vector in thread 1,.... and vector in thread n) to create a 1d vector and send back to CPU. Is there a good way to do this? Any help is appreciated. Thank you.

Solution

The most performant way to store items from several threads into a single vector will be thread-interleaved. Suppose each of 4 threads (t0-t3) has 4 elements to store (e0-e3). The final storage pattern which will be most efficient will be:

t0e0 t1e0 t2e0 t3e0 t0e1 t1e1 t2e1 t3e1 t0e2 t1e2 t2e2 t3e2 t0e3 t1e3 t2e3 t3e3

The code to do that would look like this:

#include <thrust/device_vector.h>
const int nt = 4;
const int ne = 4;
template <typename T>
__global__ void k(T *d){

  T e0 = threadIdx.x+10;
  T e1 = threadIdx.x+20;
  T e2 = threadIdx.x+30;
  T e3 = threadIdx.x+40;
  d[threadIdx.x]      = e0;
  d[threadIdx.x+nt]   = e1;
  d[threadIdx.x+2*nt] = e2;
  d[threadIdx.x+3*nt] = e3;
}

int main(){

  thrust::device_vector<int> d(ne*nt);
  k<<<1,nt>>>(thrust::raw_pointer_cast(d.data()));
}

In your question it seems that you want this order:

t0e0 t0e1 t0e2 t0e3 t1e0 t1e1 t1e2 t1e3 t2e0 t2e1 t2e2 t2e3 t3e0 t3e1 t3e2 t3e3

That storage pattern will generally be less efficient, but you could achieve it like this:

#include <thrust/device_vector.h>
const int nt = 4;
const int ne = 4;
template <typename T>
__global__ void k(T *d){

  T e0 = threadIdx.x+10;
  T e1 = threadIdx.x+20;
  T e2 = threadIdx.x+30;
  T e3 = threadIdx.x+40;
  d[threadIdx.x*nt]    = e0;
  d[threadIdx.x*nt+1]  = e1;
  d[threadIdx.x*nt+2]  = e2;
  d[threadIdx.x*nt+3]  = e3;
}

int main(){

  thrust::device_vector<int> d(ne*nt);
  k<<<1,nt>>>(thrust::raw_pointer_cast(d.data()));
}

The storage efficiency difference in these two cases is the difference between uncoalesced and coalesced behavior, which is covered in numerous questions here on the cuda SO tag, such as this one.