Performance of thrust::count

I have the following code as part of a reorganization of data for later use in a CUDA kernel:

thrust::device_ptr<int> dev_ptr = thrust::device_pointer_cast(dev_particle_cell_indices);
int total = 0;
for(int i = 0; i < num_cells; i++) {
    particle_offsets[i] = total;
    // int num = 0;
    int num = thrust::count(dev_ptr, dev_ptr + num_particles, i);
    particle_counts[i] = num;
    total += num;
}

Now, if I set num to 0 (uncomment the 5th line, and comment out the 6th), the application runs at over 30 fps, which is my goal. However, when I set num equal to the thrust::count call, the framerate drops to about 1-2 fps. Why would that happen?

My understanding is that thrust is supposed to be a collection of highly optimized algorithms which harness the power of the GPU, so I'm surprised that it would have that kind of impact on the performance of my program. This is my first time using thrust though, so I may be unaware of some important details.

Is there something about using thrust::count in a loop which is causing it to run so slowly? How can I optimize my usage of it?

To give some figures, in my current test case, num_particles is about 2000, and num_cells is about 1500.

Solution

The performance of thrust::count is just fine, it is the way you are trying to use it that is problematic for performance. If you had a lot of particles and only a few cells, then your implementation using thrust::count is probably not a bad idea. Your problem is that you have 1500 cells. That means 1500 invocations of count and 1500 device to host memory transfers each time you want to do the computation. The latency of all the kernel launches and all the PCI-e bus copies will kill performance, as you have found.

A better approach for a large number of cells would be something like this:

thrust::device_ptr<int> rawin = thrust::device_pointer_cast(dev_particle_cell_indices);

// Sort a scratch copy of the cell indices by value
thrust::device_vector<int> cidx(num_particles);
thrust::copy(rawin, rawin+num_particles, cidx.begin());
thrust::sort(cidx.begin(), cidx.end());

// Use binary search to extract all the cell counts/offsets
thrust::counting_iterator<int> cellnumber(0);
thrust::device_vector<int> offsets(num_cells), counts(num_cells);

// Offsets come from lower_bound of the ordered cell numbers
thrust::lower_bound(cidx.begin(), cidx.end(), cellnumber, cellnumber+num_cells, offsets.begin());

// Counts come from the adjacent_difference of the upper_bound of the ordered cell numbers
thrust::upper_bound(cidx.begin(), cidx.end(), cellnumber, cellnumber+num_cells, counts.begin());
thrust::adjacent_difference(counts.begin(), counts.end(), counts.begin());

// Copy back to the host pointer
thrust::copy(counts.begin(), counts.end(), particle_counts);
thrust::copy(offsets.begin(), offsets.end(), particle_offsets);

Here, we first sort a local copy of the cell indices, then use the thrust binary search functions to perform the same operation as your code, but with far fewer passes through the data in GPU memory and only two device to host memory copies to get all the results back to the host.

When I benchmark your thrust::count implementation with the code I posted above for a non-trivial case (10000 random particles and 2000 cells on a GeForce 320M with CUDA 4.1 on OS X), I find that your version takes about 0.95 seconds to run, whereas the sort/search version takes about 0.003 seconds to run. So there is probably several hundred times speedup available to you using thrust if you use a more efficient strategy and more appropriate algorithms.