Perform sum of vectors in CUDA/thrust

So I'm trying to implement stochastic gradient descent in CUDA, and my idea is to parallelize it similar to the way that is described in the paper Optimal Distributed Online Prediction Using Mini-Batches

That implementation is aimed at MapReduce distributed environments so I'm not sure if it's optimal when using GPUs.

In short the idea is: at each iteration, calculate the error gradients for each data point in a batch (map), take their average by sum/reducing the gradients, and finally perform the gradient step updating the weights according to the average gradient. The next iteration starts with the updated weights.

The thrust library allows me to perform a reduction on a vector allowing me for example to sum all the elements in a vector.

My question is: How can I sum/reduce an array of vectors in CUDA/thrust? The input would be an array of vectors and the output would be a vector that is the sum of all the vectors in the array (or, ideally, their average).

Solution

Converting my comment into this answer:

Let's say each vector has length m and the array has size n. An "array of vectors" is then the same as a matrix of size n x m.

If you change your storage format from this "array of vectors" to a single vector of size n * m, you can use thrust::reduce_by_key to sum each row of this matrix separately.

The sum_rows example shows how to do this.