Search code examples
cudathrust

Array Length in CUDA thrust


My CUDA kernel is using thrust, sort and reduce by key. when i use an array more than 460 it start showing incorrect results.

could any one explain this behavior? or it is something related to my machine ?

the sort is working correctly despite the size, however, the REDUCE_BY_KEY is not working good. and return improper results.

more details about the code, i have 4 arrays 1) input keys which is defined as wholeSequenceArray. 2) input values which is defined in the kernel with initial value of 1. 3) output keys is to save the distinct values of the input keys 4) output values is to save the sum of input values corresponding to the same input key .

for more description about the reduce_by_key please visit this page: https://thrust.github.io/doc/group__reductions.html#gad5623f203f9b3fdcab72481c3913f0e0

here is my code:

#include <cstdlib>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <cuda.h>
#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

using namespace std;
#define size 461

__global__ void calculateOccurances(unsigned int *input_keys,
            unsigned int *output_Values) {
    int tid = threadIdx.x;

    const int N = size;
    __shared__ unsigned int input_values[N];

    unsigned int outputKeys[N];

    int i = tid;
    while (i < N) {
            if (tid < N) {
                    input_values[tid] = 1;
            }
            i += blockDim.x;
    }
    __syncthreads();

    thrust::sort(thrust::device, input_keys, input_keys + N);

    thrust::reduce_by_key(thrust::device, input_keys, input_keys + N,
                    input_values, outputKeys, output_Values);

    if (tid == 0) {
            for (int i = 0; i < N; ++i) {
                    printf("%d,", output_Values[i]);
            }
    }

}

int main(int argc, char** argv) {

    unsigned int wholeSequenceArray[size] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                    11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                    10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
                    9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
                    8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6,
                    7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5,
                    6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4,
                    5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3,
                    4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2,
                    3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1,
                    2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
                    20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
                    19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
                    18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
                    17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
                    16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                    15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                    14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                    13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                    12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                    11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                    10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
                    9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
                    8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1 };

    cout << "wholeSequenceArray:" << endl;
    for (int i = 0; i < size; i++) {
            cout << wholeSequenceArray[i] << ",";
    }

    cout << "\nStart C++ Array New" << endl;
    cout << "Size of Input:" << size << endl;

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    printf("Max threads per block:  %d\n", prop.maxThreadsPerBlock);

    unsigned int counts[size];
    unsigned int *d_whole;
    unsigned int *d_counts;

    cudaMalloc((void**) &d_whole, size * sizeof(unsigned int));
    cudaMalloc((void**) &d_counts, size * sizeof(unsigned int));

    cudaMemcpy(d_whole, wholeSequenceArray, size * sizeof(unsigned int),
                    cudaMemcpyHostToDevice);

    calculateOccurances<<<1, size>>>(d_whole, d_counts);

    cudaMemcpy(counts, d_counts, size * sizeof(unsigned int),
                    cudaMemcpyDeviceToHost);

    cout << endl << "Counts" << endl << endl;
    for (int i = 0; i < size; ++i) {
            cout << counts[i] << ",";
    }
    cout << endl;

    cudaFree(d_whole);
}

Solution

  • When you call a thrust algorithm in a kernel, that thrust algorithm is dispatched in its entirety from each CUDA thread. Therefore your code is performing 461 sort operations on the same data (once from each CUDA kernel thread) in the same place. This means each thread will be stepping on each other as they move the data about during the sorting operation.

    If you just want to count occurrences of numbers (effectively histogramming) using the method you have outlined in your question, and you want to use thrust, you don't need to write a CUDA kernel at all.

    If you really want to do this (correctly) from within a CUDA kernel, then it will be necessary for you to restrict the thrust operations (sort and reduce_by_key) to only act from a single thread. (and even this methodology will be restricted to a single block).

    I don't really think the second (CUDA kernel) approach makes much sense, but for completeness I have modified your code to include a correct example of each method. Note that once you perform the reduction, there is no longer any point in printing out all 461 entries in each array, so I've restricted the printouts to the first 25 entries in each array for clarity:

    $ cat t91.cu
    #include <cstdlib>
    #include <stdlib.h>
    #include <stdio.h>
    #include <iostream>
    #include <vector>
    #include <fstream>
    #include <string>
    #include <cuda.h>
    #include <cuda_runtime.h>
    #include <thrust/device_vector.h>
    #include <thrust/host_vector.h>
    #include <thrust/sort.h>
    #include <thrust/reduce.h>
    #include <thrust/execution_policy.h>
    #include <thrust/iterator/constant_iterator.h>
    
    using namespace std;
    #define size 461
    
    __global__ void calculateOccurances(unsigned int *input_keys,
                unsigned int *output_Values) {
        int tid = threadIdx.x;
    
        const int N = size;
        __shared__ unsigned int input_values[N];
    
        unsigned int outputKeys[N];
    
        int i = tid;
        while (i < N) {
                if (tid < N) {
                        input_values[tid] = 1;
                }
                i += blockDim.x;
        }
        __syncthreads();
        if (tid == 0){
          thrust::sort(thrust::device, input_keys, input_keys + N);
    
          thrust::reduce_by_key(thrust::device, input_keys, input_keys + N,
                        input_values, outputKeys, output_Values);
          }
    
        if (tid == 0) {
        printf("from kernel:\n");
                for (int i = 0; i < 25; ++i) {
                        printf("%d,", output_Values[i]);
                }
        }
    
    }
    
    int main(int argc, char** argv) {
    
        unsigned int wholeSequenceArray[size] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                        11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                        10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
                        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
                        8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6,
                        7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5,
                        6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4,
                        5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3,
                        4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2,
                        3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1,
                        2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
                        20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
                        19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
                        18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
                        17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
                        16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                        15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                        14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                        13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                        12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                        11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                        10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
                        9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
                        8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1 };
    
        cout << "wholeSequenceArray:" << endl;
        for (int i = 0; i < size; i++) {
                cout << wholeSequenceArray[i] << ",";
        }
    
        cout << "\nStart C++ Array New" << endl;
        cout << "Size of Input:" << size << endl;
    
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, 0);
        printf("Max threads per block:  %d\n", prop.maxThreadsPerBlock);
    
    //just using thrust
    
        thrust::device_vector<int> d_seq(wholeSequenceArray, wholeSequenceArray+size);
        thrust::device_vector<int> d_val_out(size);
        thrust::device_vector<int> d_key_out(size);
    
        thrust::sort(d_seq.begin(), d_seq.end());
        int rsize = thrust::get<0>(thrust::reduce_by_key(d_seq.begin(), d_seq.end(), thrust::constant_iterator<int>(1), d_key_out.begin(), d_val_out.begin())) - d_key_out.begin();
        std::cout << "rsize:" << rsize <<  std::endl;
        std::cout << "Thrust keys:" << std::endl;
        thrust::copy_n(d_key_out.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
        std::cout << std::endl << "Thrust vals:" << std::endl;
        thrust::copy_n(d_val_out.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
        std::cout << std::endl;
    
    
    // in a cuda kernel
    
    
        unsigned int counts[size];
        unsigned int *d_whole;
        unsigned int *d_counts;
    
        cudaMalloc((void**) &d_whole, size * sizeof(unsigned int));
        cudaMalloc((void**) &d_counts, size * sizeof(unsigned int));
    
        cudaMemcpy(d_whole, wholeSequenceArray, size * sizeof(unsigned int),
                        cudaMemcpyHostToDevice);
    
        calculateOccurances<<<1, size>>>(d_whole, d_counts);
    
        cudaMemcpy(counts, d_counts, size * sizeof(unsigned int),
                        cudaMemcpyDeviceToHost);
    
        std::cout << "from Host:" << std::endl;
        cout << endl << "Counts" << endl << endl;
        for (int i = 0; i < 25; ++i) {
                cout << counts[i] << ",";
        }
        cout << endl;
    
        cudaFree(d_whole);
    }
    $ nvcc -arch=sm_61 -o t91 t91.cu
    $ ./t91
    wholeSequenceArray:
    1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,
    Start C++ Array New
    Size of Input:461
    Max threads per block:  1024
    rsize:20
    Thrust keys:
    1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
    Thrust vals:
    24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,
    from kernel:
    24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,526324,526325,526325,526327,526329,from Host:
    
    Counts
    
    24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,526324,526325,526325,526327,526329,
    $
    

    Notes:

    1. I've included a method in the thrust example so you can know exactly the size of the output arrays.

    2. The thrust method should work fine independent of the size parameter - subject to the limits of your GPU (e.g. memory size). The CUDA kernel method really is just doing the thrust code from a single thread, so it's not really sensible to run more than 1 block.

    3. You may wish to refer to this question/answer for more discussion around using thrust from CUDA kernels.