Search code examples
c++parallel-processingsyclintel-oneapi

SYCL/DPC++ cpu version gives correct result, but gpu gives incorrect data


I compiled and ran the below code with intel dpc++ compiler. I am getting right result when using cpu selector but gpu selector gives garbage value.

All that my code does is an array named data is intialised with all 1's. In sycl kernel an accessor to this array is multiplied by 3 and saved to a result array. I try to print values in the result array which is expected to have all 3's but I am getting junk values.

As I was getting junk values when executing code on gpu. I tried running on cpu selector , here the code works without issues.

I tried this on linux and windows. compiler version dpcpp 2021.3

#include "iostream"
#include<CL/sycl.hpp>
#include <array>
using namespace std;
using namespace sycl;
int main() {        
    
    sycl::gpu_selector selector;   
    //using cpu selector as in the line below works
    //sycl::cpu_selector selector;    
    sycl::queue q = sycl::queue(selector);
    std::cout << q.get_device().get_info<sycl::info::device::name>();    
    constexpr int size = 3;
    std::array<int, size> data{1,1,1};
    std::array<int, size> resultarray;
    range<1> num_items{ size };
    buffer<int, 1> data_buff(data.data(), num_items);
    buffer<int, 1> result(resultarray.data(), num_items);

    
    
    q.submit([&](sycl::handler& cgh) 
        {
        auto dataAccess = data_buff.get_access<access::mode::read_write>(cgh);
        auto resultAccess = result.get_access<access::mode::write>(cgh);
        cgh.parallel_for(num_items, [=](id<1>  i)
            {
                resultAccess[i] = dataAccess[i] * 3;
            });

        }).wait();


    std::cout <<"||"<< resultarray[0]<<"||"; //expected result ||3||
}

Can somebody help why the code would be giving wrong result on GPU?


Solution

  • You are not triggering a copy back to host. Presumably on CPU, your SYCL implementation just decides to operate directly on the input pointer, so you don't see the problem.

    Think about this: How could the SYCL implementation know that resultarray is being used in your cout and that data has to be copied back? It cannot, because this memory access does not go through any SYCL construct. Therefore it cannot know that it has to copy data back. The wait() only causes the host to wait until the kernel has completed, it does not trigger copies.

    The most important ways of triggering the necessary copy are:

    • Using a buffer writeback: By default, buffers constructed from a host pointer will write back their content in the buffer destructor to the data pointers they were constructed with (there are also member functions in buffer to manually enable/disable this feature). In your case, wrapping the buffer declaration and kernel in additional { } should suffice, since then the buffer would go out of scope before your cout, and the write back is triggered.
    • Using a host_accessor instead of directly accessing resultarray
    • Using explicit handler::copy()