Can thrust::gather be used "in-place"?

Consider the following code:

#include <time.h>       // --- time
#include <stdlib.h>     // --- srand, rand
#include<fstream>

#include <thrust\host_vector.h>
#include <thrust\device_vector.h>
#include <thrust\sort.h>
#include <thrust\iterator\zip_iterator.h>

#include "TimingGPU.cuh"

/********/
/* MAIN */
/********/
int main() {

    const int N = 16384;

    std::ifstream h_indices_File, h_x_File;
    h_indices_File.open("h_indices.txt");
    h_x_File.open("h_x.txt");

    std::ofstream h_x_result_File;
    h_x_result_File.open("h_x_result.txt");

    thrust::host_vector<int> h_indices(N);
    thrust::host_vector<double> h_x(N);
    thrust::host_vector<double> h_sorted(N);

    for (int k = 0; k < N; k++) {
        h_indices_File >> h_indices[k];
        h_x_File >> h_x[k];
    }

    thrust::device_vector<int> d_indices(h_indices);
    thrust::device_vector<double> d_x(h_x);

    thrust::gather(d_indices.begin(), d_indices.end(), d_x.begin(), d_x.begin());
    h_x = d_x;
    for (int k = 0; k < N; k++) h_x_result_File << h_x[k] << "\n";

    //thrust::device_vector<double> d_x_sorted(N);
    //thrust::gather(d_indices.begin(), d_indices.end(), d_x.begin(), d_x_sorted.begin());
    //h_x = d_x_sorted;
    //for (int k = 0; k < N; k++) h_x_result_File << h_x[k] << "\n";

}

The code loads from file an array of indices h_indices.txt and a double array h_x.txt. Then, it transfers those arrays to the GPU to d_indices and d_x and uses thrust::gather to achieve Matlab's equivalent

d_x(d_indices)

The two txt files can be downloaded from h_indices.txt and h_x.txt. The code creates an output result file h_x_result.txt.

If I use the "in-place" version of thrust::gather (the last uncommented three lines of the code), then I obtain that the result is different from d_x(d_indices), while if I use the not "in-place" version (the last commented three lines of the code), then the result is correct.

In Matlab, I'm using

load h_indices.txt; load h_x.txt; load h_x_result.txt
plot(h_x(h_indices + 1)); hold on; plot(h_x_result, 'r'); hold off

The "in-place" case returns the following comparison

On the other side, the "in-place" case returns

I'm using Windows 10, CUDA 8.0, Visual Studio 2013, compiling in Release Mode and running on an NVIDIA GTX 960 cc. 5.2.

Solution

Thrust gather can't be used in place.

But I would go as far as to suggest that no "naïve" gather operation can be safely performed in-place, and that the Matlab snippet you presented as in-place (presumably d_x = d_x(d_indices)) isn't an in-place operation at all.