Why intersection of Thrust Library is returning unexpected result?

I'm using the library Thrust to get intersection set of a two larger sets of integers. In test with 2 small inputs i got correct results, but when i use two sets with 10^8 and 65535*1024 elements i got a empty set. Who's can explain this problem? Changing the two first variables to smaller values the thrust returns a expected intersection set. My code is following.

#include <thrust/set_operations.h>
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <iostream>
#include <stdio.h>


int main() {
    int sizeArrayLonger = 100*1000*1000;
    int sizeArraySmaller = 65535*1024;
    int length_result = sizeArraySmaller;    
    int* list = (int*) malloc(4*sizeArrayLonger);
    int* list_smaller = (int*) malloc(4*sizeArraySmaller);
    int* result = (int*) malloc(4*length_result);

    int* list_gpu;
    int* list_smaller_gpu;
    int* result_gpu;

    // THE NEXT TWO FORS TRANSFORMS THE SMALLER ARRAY IN A SUBSET OF THE LARGER ARRAY
    for (int i=0; i < sizeArraySmaller; i++) {
        list_smaller[i] = i+1;
        list[i] = i+1;
    }
    for (int i=sizeArraySmaller; i < sizeArrayLonger; i++) {
        list[i] = i+1;
    }

    cudaMalloc(&list_gpu, sizeof(int) * sizeArrayLonger);
    cudaMalloc(&list_smaller_gpu, sizeof(int) * sizeArraySmaller);
    cudaMalloc(&result_gpu, sizeof(int) * length_result);

    cudaMemcpy(list_gpu, list, sizeof(int) * sizeArrayLonger, cudaMemcpyHostToDevice);
    cudaMemcpy(list_smaller_gpu, list_smaller, sizeof(int) * sizeArraySmaller, cudaMemcpyHostToDevice);
    cudaMemset(result_gpu, 0, sizeof(int) * length_result);

    typedef thrust::device_ptr<int> device_ptr;

    thrust::set_intersection(device_ptr(list_gpu), device_ptr(list_gpu + sizeArrayLonger), device_ptr(list_smaller_gpu),
        device_ptr(list_smaller_gpu + sizeArraySmaller), device_ptr(result_gpu), thrust::less<int>() );

    // MOVING TO CPU THE MARKER ARRAY OF ELEMENTS OF INTERSECTION SET
    cudaMemcpy(result, result_gpu, sizeof(int)*length_result, cudaMemcpyDeviceToHost);

    cudaDeviceSynchronize();

    // THIS LOOP ITERATES ALL ARRAY NAMED "result" WHERE THE POSITION ARE MARKED WITH 1
    int counter = 0;
    for (int i=0; i < length_result; i++)
        if (result[i]) {
            printf("\n-> %d", result[i]);
            counter++;
        }

    printf("\nTHRUST -> Total of elements: %d\n", counter);

    cudaDeviceReset();

    return 0;
}

Solution

It seems OP hasn't visited recently so I'll expand on my comments for other readers. (I was hoping to get some confirmation that specifying the compute target of the device in use during compilation would fix OP's observation as well.)

According to my testing, OP's code will:

Pass if compiled for a cc2.0 device and run on a cc2.0 device.
Pass if compiled for a cc3.0 device and run on a cc3.x device.
FAIL if compiled for a cc2.0 device and run on a cc3.x device.

This last result is a bit non-intuitive. Normally we like to think of CUDA code compiled with PTX (e.g. nvcc -arch=sm_20 ... or similar) as being forward-compatible with future architectures, due to the runtime JIT mechanism.

However, there is a trap (as well as a somewhat related issue in thrust.) It's not uncommon for CUDA codes to query the device they are actually running on (e.g. via cudaGetDeviceProperties) and make decisions (such as kernel configuration decisions) based on the device in use. Specifically, in this case, thrust is launching a kernel under the hood, and making a decision about the size of the grid x dimension to choose for this kernel based on the actual device in use. CC 2.x devices are limited to 65535 for this parameter but CC 3.x and higher devices have a much higher limit. So, in this case, for a large enough data set, if thrust detects that it is running on a cc3.0 device, it will configure this particular kernel with a grid x dimension larger than 65535. (For a small enough data set, it would not do this, and so this possible error would not surface. Thus, the issue is loosely connected to problem size.)

If we had both cc 2.x and cc 3.x PTX (or appropriate SASS) embedded in the binary, then there still would not be an issue. However, if we have only cc2.x PTX embedded in the binary, then the JIT process will use this to create machine code suitable to run on a cc 3.x device, if that is the device being used. But this forward-JIT-compiled SASS is still subject to CC 2.x limitations, including the grid X dimension limit of 65535. However cudaGetDeviceProperties returns the fact that the device is a cc3.x device, and thus this information will be misleading if it is used for this particular decision (acceptable grid X dimensions).

As a result of this sequence, the kernel is incorrectly configured, and the kernel launch fails with a particular kind of non-sticky CUDA runtime API error. This type of non-sticky error does not corrupt the CUDA context, and so further CUDA operations are still permissible, and future CUDA API calls will not return this error. In order to trap this kind of error after a CUDA kernel launch, it's necessary to issue a cudaGetLastError() or cudaPeekAtLastError() call after the kernel launch, as suggested for proper cuda error checking. A failure to do this means the error is "lost" and not discoverable from future CUDA API calls (excepting cudaGetLastError() or cudaPeekAtLastError()) as they will not indicate the presence of this error or failed kernel launch in the status return value.

Most of the above is discoverable with careful usage of the cuda profiling tools, e.g. nvprof, in the passing and failing cases, as well as cuda-memcheck. In the passing cases, cuda-memcheck reports no errors, and the profiler reveals 8 calls to cudaLaunch as well as 8 kernels actually executed on the GPU. In the failing cases, cuda-memcheck reports 2 kernel launch failures of the type described above, and the profiler shows 8 invocations of cudaLaunch but only 6 kernels actually executed on the GPU. The kernels that are failing are configured with a grid X dimension of 65535 when run on a cc2.x GPU, and are configured with a larger number when run on a cc3.x GPU.

So with proper cuda error checking, the above sequence, while not necessarily desirable, would at least fail with an explicit error. But OP's code fails silently - it returns an incorrect result in the failing case, but thrust doesn't throw any kind of error.

It turns out, under the hood, that thrust error checking on kernels launched from set operations (at least this one, in particular) have this particular error checking gap.

By studying the profiler output carefully, we can discover which files contain the code that thrust is using for launch closure in this case (i.e. where the kernel launch is actually coming from). (You could also figure this out by carefully tracing the templating sequence.) In the particular failing case, I believe the kernel launch is arising from here. If we look at one of the kernel launches there, we see something like this:

#ifndef __CUDA_ARCH__ 
  kernel<<<(unsigned int) num_blocks, (unsigned int) block_size, (unsigned int) smem_size, stream(thrust::detail::derived_cast(exec))>>>(f); 
#else 
  ...
#endif // __CUDA_ARCH__ 
  synchronize_if_enabled("launch_closure_by_value");

synchronize_if_enabled (in this particular code path) would be called immediately after the kernel launch. That function can be found here:

inline __host__ __device__ 
void synchronize_if_enabled(const char *message) 
{ 
// XXX this could potentially be a runtime decision 
//     note we always have to synchronize in __device__ code 
#if __THRUST_SYNCHRONOUS || defined(__CUDA_ARCH__) 
  synchronize(message); 
#else 
  // WAR "unused parameter" warning 
  (void) message; 
#endif

which calls synchronize():

inline __host__ __device__ 
void synchronize(const char *message) 
{ 
  throw_on_error(cudaDeviceSynchronize(), message); 
} // end synchronize()

and we see in synchronize() that throw_on_error calls cudaDeviceSynchronize() which eliminates the previous non-sticky error 11 signifying a misconfigured kernel launch attempt, and in fact returns cudaSuccess (because the cudaDeviceSynchronize() operation itself was, in fact, successful.)

So the summary is that there are 2 issues present:

Thrust (in this case) makes a runtime decision about kernel launch configuration which will be incorrect if the execution device is cc3.0 or higher and the code is compiled for cc2.x (only).
Thrust error checking on this particular set_intersection call is deficient in that it does not have the proper mechanism to catch the non-sticky CUDA runtime API error (error 11) assocated with a misconfigured kernel launch.

The recommendation, then, is to always compile your thrust code specifying a cc3.0 or higher target (at least), if you intend to run on a cc3.0 or higher device. (You can, of course, specify both a cc2.x and a cc3.x target with appropriate choice of nvcc command line switches.) Thrust uses a variety of launch mechanisms under the hood, and not all (perhaps most) are not subject to this particular deficiency (#2), but it appears (to me) that this particular set_intersection call is subject to this deficiency, at this time (thrust v1.8).

It's not clear (to me) that there is a way to otherwise systemically address the first issue above (#1). I have brought the second issue above (#2) to the attention of the thrust developers (via a RFE or bug report.)

As a workaround, a thrust developer can insert a call to cudaGetLastError() in their thrust application (perhaps at the end) to keep this type of error from being "silent".