Search code examples
c++cudathrust

Copying data from a smaller vector to a larger vector


I am working on a GPU project using Thrust. Instead of trying to explain what I am doing, I am going to offer a simple, slightly generic scenario that is easier to explain and might help somebody else in the future.

Let's say I have a vector where I would like to modify every third element of the vector.

The two solutions I can think of to handle this are:

1) Using a thrust call like transform that modifies every third element, maybe with a predicate or something.

2) Copy every third element into a smaller vector, call transform on that, copy those elements back to their original spots of the original vector.

Are either of these possible using Thrust?

Is there another way or a better way to pull this off?

All advice is appreciated!


Solution

  • Are either of these possible using Thrust?

    Yes, both are possible.

    Is there another way or a better way to pull this off?

    To some extent, the best possible approach may vary depending on what else may be happening with this data in your application. But within the confines of what you've outlined, I think a thrust strided range is likely to be a good choice, possibly the best choice.

    Your first method is workable as-is of course, with an appropriately defined functor to condition the behavior. (For example, zip your data with a constant_iterator to provide a data "index", and have the functor condition the transform of the data on the corresponding "index"). However it would suffer the drawback that we would need to launch 3 times as many threads as are necessary (since only 1 out of 3 threads are doing any actual vector modification). The strided range method improves on this in that every thread will be doing the work of modifying a chosen vector element, and there are no "wasted" threads.

    This method still has some level of "inefficiency" in it, since we are accessing 3 times as much data (either with your functor/predicate method, or the strided range approach) due to GPU data loading characteristics. Your second method (copying every 3rd element to a smaller vector) would alleviate this inefficiency, but you pay the cost of a data copy operation, which would negate any benefits for the context of a single transform operation. However, if there were a number of additional steps you wanted to perform on this reduced size vector, the overhead/cost of copying the data to a smaller vector might then be recovered through the sequence of multiple remaining operations that are not paying the "inefficiency" of accessing 3 times as much data.

    However the strided range method should still be useful either to copy elements from the larger vector to the smaller vector, or to directly launch a transform operation on the larger vector, but only modifying specific elements.

    Here's a worked example, basically a trivial modification of the strided range example, that demonstrates 2 possible approaches - the first being a copy-then-transform, the second being a transform-in-place:

    $ cat t996.cu
    #include <thrust/iterator/counting_iterator.h>
    #include <thrust/iterator/transform_iterator.h>
    #include <thrust/iterator/permutation_iterator.h>
    #include <thrust/functional.h>
    #include <thrust/fill.h>
    #include <thrust/device_vector.h>
    #include <thrust/copy.h>
    #include <thrust/transform.h>
    #include <iostream>
    
    #define STRIDE 3
    
    // this example illustrates how to make strided access to a range of values
    // examples:
    //   strided_range([0, 1, 2, 3, 4, 5, 6], 1) -> [0, 1, 2, 3, 4, 5, 6]
    //   strided_range([0, 1, 2, 3, 4, 5, 6], 2) -> [0, 2, 4, 6]
    //   strided_range([0, 1, 2, 3, 4, 5, 6], 3) -> [0, 3, 6]
    //   ...
    
    template <typename Iterator>
    class strided_range
    {
        public:
    
        typedef typename thrust::iterator_difference<Iterator>::type difference_type;
    
        struct stride_functor : public thrust::unary_function<difference_type,difference_type>
        {
            difference_type stride;
    
            stride_functor(difference_type stride)
                : stride(stride) {}
    
            __host__ __device__
            difference_type operator()(const difference_type& i) const
            {
                return stride * i;
            }
        };
    
        typedef typename thrust::counting_iterator<difference_type>                   CountingIterator;
        typedef typename thrust::transform_iterator<stride_functor, CountingIterator> TransformIterator;
        typedef typename thrust::permutation_iterator<Iterator,TransformIterator>     PermutationIterator;
    
        // type of the strided_range iterator
        typedef PermutationIterator iterator;
    
        // construct strided_range for the range [first,last)
        strided_range(Iterator first, Iterator last, difference_type stride)
            : first(first), last(last), stride(stride) {}
    
        iterator begin(void) const
        {
            return PermutationIterator(first, TransformIterator(CountingIterator(0), stride_functor(stride)));
        }
    
        iterator end(void) const
        {
            return begin() + ((last - first) + (stride - 1)) / stride;
        }
    
        protected:
        Iterator first;
        Iterator last;
        difference_type stride;
    };
    
    int main(void)
    {
        thrust::device_vector<int> data(8);
        data[0] = 10;
        data[1] = 20;
        data[2] = 30;
        data[3] = 40;
        data[4] = 50;
        data[5] = 60;
        data[6] = 70;
        data[7] = 80;
    
        // print the initial data
        std::cout << "initial data: " << std::endl;
        thrust::copy(data.begin(), data.end(), std::ostream_iterator<int>(std::cout, " "));  std::cout << std::endl;
    
        typedef thrust::device_vector<int>::iterator Iterator;
    
        // create strided_range with indices [0,3,6]
        strided_range<Iterator> strided(data.begin(), data.end(), STRIDE);
        // method 1: copy data from larger vector to smaller, then transform it:
        thrust::device_vector<int> result1(data.size()/STRIDE+1);
        thrust::copy(strided.begin(), strided.end(), result1.begin());
        thrust::transform(result1.begin(), result1.end(), result1.begin(), thrust::negate<int>());
        std::cout << "method 1 result: " << std::endl;
        thrust::copy(result1.begin(), result1.end(), std::ostream_iterator<int>(std::cout, " "));
        std::cout << std::endl;
    
        // method 2: transform data "in-place":
        std::cout << "method 2 result: " << std::endl;
        thrust::transform(strided.begin(), strided.end(), strided.begin(), thrust::negate<int>());
        thrust::copy(data.begin(), data.end(), std::ostream_iterator<int>(std::cout, " "));  std::cout << std::endl;
    
        return 0;
    }
    $ nvcc -o t996 t996.cu
    $ ./t996
    initial data:
    10 20 30 40 50 60 70 80
    method 1 result:
    -10 -40 -70
    method 2 result:
    -10 20 30 -40 50 60 -70 80
    $