Thrust::sort crashes invalid argument

i'm trying to use thrust::sort on device memory. But it crashes at runtime. I'm also tried to disable debug information generation.

Here is a minimal example:

cudaSetDevice(0);
int u[10];
int* v;
cudaMalloc(&v, 10 * sizeof(int));
for (int i = 0; i < 10 ; i++)
    u[i] = 10-i;
cudaMemcpy(u, v, 10 * sizeof(int), cudaMemcpyHostToDevice);
try{
    thrust::sort(thrust::device_ptr<int>(v), thrust::device_ptr<int>(v+10));
}
catch (thrust::system_error &e)
    printf("Error: %s \n",e.what());
cudaMemcpy(v, u, 10 * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < 10; i++)
    printf("%d\n", u[i]);

e.what() gives the following message:

CUDA error 11 [c:\program files\nvidia gpu computing toolkit\cuda\v7.5\include\thrust\system\cuda\detail\cub\device\disp atch/device_radix_sort_dispatch.cuh, 687]: invalid argument CUDA error 11 [c:\program files\nvidia gpu computing toolkit\cuda\v7.5\include\thrust\system\cuda\detail\cub\device\disp atch/device_radix_sort_dispatch.cuh, 875]: invalid argument Error: after cub_::DeviceRadixSort::SortKeys(1): invalid argument

I'm using a GeForce 940M and VS13 with Cuda assistant for project generation. The nvcc Build-line is:

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin\nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2013 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include" --keep-dir Debug -maxrregcount=0 --machine 32 --compile -cudart static -g -DWIN32 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o Debug\kernel.cu.obj "C:\Users\ndrei\Documents\Visual Studio\2013\Projects\Thrust_Test\Thrust_Test\kernel.cu"

Please help me!

Solution

First of all, your example is incomplete. What good does it do to cut off the header files, etc? This doesn't make it easier for others who are trying to help you. I don't think this is even an excerpt of the code you are actually running, because your try/catch is formed incorrectly (compile error).

In the future, please provide a proper MCVE. It should be a complete code that someone could copy, paste, compile and run, without having to add anything or change anything.

Regarding the thrust error:

cudaMemcpy is not part of thrust. It is part of the cuda runtime API, and you are advised to use proper cuda error checking any time you are having trouble with a code that uses the cuda runtime API. If you had done so, rather than having no clue where the error was, your focus would have been immediately reduced to a single line of code.
This is wrong:
```
cudaMemcpy(u, v, 10 * sizeof(int), cudaMemcpyHostToDevice);
```
cudaMemcpy, like memcpy, takes as its first parameter the pointer to the destination of the copy operation, followed by the pointer to the source of the copy operation.

Since u is host pointer and v is a device pointer, this is inconsistent with your intent (as well as the stated direction of the copy operation, i.e. cudaMemcpyHostToDevice).
You've made a similar error on the subsequent cudaMemcpy operation as well.

What follows is a better example of a MCVE. It is a modified version of the code you have shown with the errors fixed:

#include <stdio.h>
#include <thrust/sort.h>
#include <thrust/device_ptr.h>
#include <thrust/system_error.h>


#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

int main(){

  cudaSetDevice(0);
  int u[10];
  int* v;
  cudaMalloc(&v, 10 * sizeof(int));
  cudaCheckErrors("cudaMalloc fail");
  for (int i = 0; i < 10 ; i++)
    u[i] = 10-i;
  cudaMemcpy(v, u, 10 * sizeof(int), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMemcpy 1 fail");
  try{
    thrust::sort(thrust::device_ptr<int>(v), thrust::device_ptr<int>(v+10));
    }
  catch (thrust::system_error &e){
    printf("Error: %s \n",e.what());}
  cudaMemcpy(u, v, 10 * sizeof(int), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMemcpy 2 fail");
  for (int i = 0; i < 10; i++)
    printf("%d\n", u[i]);
}

Notes:

the Thrust system error mechanism will catch pre-existing CUDA errors, and throw those, as well as any that may be associated with your actual thrust code. Therefore it's advisable to do CUDA error checking on CUDA codes, and thrust error checking on thrust codes, to make your debug process less confusing.
As an unrelated suggestion: Your project is set up to build for:
- a cc2.0 device.
- a 32-bit application
Neither of these are recommended settings. I would recommend modifying your project to build a Release, x64 application, and I would recommend modifying your build target from cc2.0 to match the compute capability of your GPU. In your project settings, this probably means changing a Visual Studio project setting under CUDA Device from compute_20,sm_20 to compute_50,sm_50 to match your GeForce 940M GPU.