Search code examples
c++cudathrust

cuda thrust::for_each with thrust::counting_iterator


I'm a bit of a newcomer to CUDA and thrust. I seem to be unable to get the thrust::for_each algorithm to work when supplied with a counting_iterator. Here is my simple functor:

struct print_Functor {
    print_Functor(){}
    __host__ __device__
    void operator()(int i)
    {
        printf("index %d\n", i);
    }
}; 

Now if I call this with a host-vector prefilled with a sequence, it works fine:

    thrust::host_vector<int> h_vec(10);
    thrust::sequence(h_vec.begin(),h_vec.end());
    thrust::for_each(h_vec.begin(),h_vec.end(), print_Functor());

However, if I try to do this with thrust::counting_iterator it fails:

    thrust::counting_iterator<int> first(0);
    thrust::counting_iterator<int> last = first+10;
    for(thrust::counting_iterator<int> it=first;it!=last;it++)
        printf("Value %d\n", *it);
    printf("Launching for_each\n");
    thrust::for_each(first,last,print_Functor());

What I get is that the for loop executes correctly, but the for_each fails with the error message:

   after cudaFuncGetAttributes: unspecified launch failure

I tried to do this by making the iterator type a template argument:

thrust::for_each<thrust::counting_iterator<int>>(first,last, print_Functor());

but the same error results.

For completeness, I'm calling this from a MATLAB mex file (64 bit).

I've been able to get other thrust algorithms to work with the counting iterator (e.g. thrust::reduce gives the right result).

As a newcomer I'm probably doing something really stupid and missing something obvious - can anyone help?

Thanks for the comments so far. I have taken on board the comments so far. The worked example (outside Matlab) worked correctly and produced output, but if this was made into a mex file it still did not work - the first time producing no output at all and the second time just producing the same error message as before (only fixed by a recompile, when it goes back to no output).

However there is a similar problem with it not executing the functor from thrust::for_each even under DOS. Here is a complete example:

#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>

struct sum_Functor {
    int *sum;
    sum_Functor(int *s){sum = s;}
    __host__ __device__
    void operator()(int i)
    {
        *sum+=i;
        printf("In functor: i %d sum %d\n",i,*sum);
    }

};

int main(){

    thrust::counting_iterator<int> first(0);
    thrust::counting_iterator<int> last = first+10;
    int sum = 0;
    sum_Functor sf(&sum);
    printf("After constructor: value is %d\n", *(sf.sum));
    for(int i=0;i<5;i++){
        sf(i);
    }

    printf("Initiating for_each call - current value %d\n", (*(sf.sum)));
    thrust::for_each(first,last,sf);

    cudaDeviceSynchronize();
    printf("After for_each: value is %d\n",*(sf.sum));
}

This is compiled under a DOS prompt with:

nvcc -o pf pf.cu

The output produced is:

After constructor: value is 0
In functor: i 0 sum 0
In functor: i 1 sum 1
In functor: i 2 sum 3
In functor: i 3 sum 6
In functor: i 4 sum 10
Initiating for_each call - current value 10
After for_each: value is 10

In other words the functor's overloaded operator() is called correctly from the for loop but is never called by the thrust::for_each algorithm. The only way to get the for_each to execute the functor when using the counting iterator is to omit the member variable.

( I should add that after years of using pure Matlab, my C++ is very rusty, so I could be missing something obvious ...)


Solution

  • On your comments you say that you want your code to be executed on host side.

    The error code "unspecified launch failure", and the fact your functor is defined as host device make me think thrust wants to execute on your device.

    Can you add an execution policy to be sure where your code is executed ?

    replace :

    thrust::for_each(first,last,sf);
    

    with

    thrust::for_each(thrust::host, first,last,sf);
    

    To be able to run on the GPU, your result must be allocated on device memory (through cudaMalloc) then copied back to host.


    #include <thrust/host_vector.h>
    #include <thrust/sequence.h>
    #include <thrust/for_each.h>
    #include <thrust/iterator/counting_iterator.h>
    #include <thrust/execution_policy.h>
    
    struct sum_Functor {
        int *sum;
        sum_Functor(int *s){sum=s;}
        __host__ __device__
        void operator()(int i)
        {
            atomicAdd(sum, 1);
        }
    };
    
    int main(int argc, char**argv){
    
    
        thrust::counting_iterator<int> first(0);
        thrust::counting_iterator<int> last = first+atoi(argv[1]);
        int *d_sum;
        int h_sum = 0;
    
        cudaMalloc(&d_sum,sizeof(int));
        cudaMemcpy(d_sum,&h_sum,sizeof(int),cudaMemcpyHostToDevice);
    
        thrust::for_each(thrust::device,first,last,sum_Functor(d_sum));
    
        cudaDeviceSynchronize();
        cudaMemcpy(&h_sum,d_sum,sizeof(int),cudaMemcpyDeviceToHost);
        printf("sum = %d\n", *h_sum);
        cudaFree(d_sum);
    
    }
    

    Code Update : To have the correct result on your device you must use an atomic operation.