Why does passing a function to a kernel cause data to become immutable?

I've reduced my project down to the just the relevant code. The part of this that is really bothering me is that this does not produce any errors. Anyways, I have a struct GpuData

struct GpuData { float x, y, z; };

My goal is to launch a kernel against this struct that takes a function and will apply the function to the struct. So lets look at an example kernel:

__global__ void StructFunctor(GpuData* in_dat, nvstd::function<float(void)> func) {
    in_dat->x = func();
    in_dat->y += T{1};
};

In this case, the kernel is reduced to something very simple. It will set the x value to the result of the function. It will then add 1 to the y value.

So lets try it. A complete source file (cuda_demo.cu):

#include <iostream>
#include <nvfunctional>

struct GpuData { float x, y, z; };

__global__ void StructFunctor(GpuData* in_dat, nvstd::function<float(void)> func) {
    in_dat->x = func();
    in_dat->y += float{1};
};

int main(int argc, char** argv) {
    GpuData c_dat {2, 3, 5};
    std::cout << "Input x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;

    GpuData* g_dat;
    cudaMalloc(&g_dat, sizeof(GpuData));
    cudaMemcpy(g_dat, &c_dat, sizeof(GpuData), cudaMemcpyHostToDevice);

    StructFunctor<<<1, 1>>>(g_dat, []()->float{return 1.0f;});

    cudaMemcpy(&c_dat, g_dat, sizeof(GpuData), cudaMemcpyDeviceToHost);
    std::cout << "Output x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;
    return 0;
}

Well if we are actually going to try it, we will need the Cmake files. I've tacked those on at the end.

On my machine it compiles and runs without errors. Here is my output:

./CudaDemo
Input x: 2 y: 3 z: 5
Output x: 2 y: 3 z: 5

They variable has not been modified at all! But if I go back and comment out in_dat-> = func(); then I get this output:

./CudaDemo
Input x: 2 y: 3 z: 5
Output x: 2 y: 4 z: 5

Now the y value has been modified! Thats a good start, but why is it that when I try and use the function the gpu memory becomes immutable? I presume this is some sort of error, but it compiles and runs without warnings or errors.

Now as promised, the cmake files to run this.

cmake_minimum_required(VERSION 3.8)
project(Temp LANGUAGES CXX CUDA)
set(CMAKE_CUDA_STANDARD 14)
add_executable(CudaDemo cuda_demo.cu)
set_property(TARGET CudaDemo PROPERTY CUDA_SEPARABLE_COMPILATION ON)

Solution

The problem is that your code is creating a lambda in host code (so it is compiled for whatever host processor you specify) and then you are attempting to use that compiled lambda in device code. This won't work. If you run your code with cuda-memcheck it indicates an error which may take one of several forms, I see a message of "Invalid PC", which means that your program attempted to execute an instruction from an invalid location:

$ cuda-memcheck ./t277
========= CUDA-MEMCHECK
Input x: 2 y: 3 z: 5
========= Invalid PC
=========     at 0x00000048 in void StructFunctor<float>(GpuData<float>*, nvstd::function<float () (void)>)
=========     by thread (0,0,0) in block (0,0,0)
=========     Device Frame:void StructFunctor<float>(GpuData<float>*, nvstd::function<float () (void)>) (void StructFunctor<float>(GpuData<float>*, nvstd::function<float () (void)>) : 0x40)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x2486ed]
=========     Host Frame:./t277 [0x190b2]
=========     Host Frame:./t277 [0x192a7]

In CUDA, if you want to use a lambda in device code, you must decorate it properly, just like any other code you intend to execute on the device. An initial introduction of this concept was made here, although you can find many other examples.

There are probably many ways to fix the code, depending on your final intent, but an approach that hews closely to the aforementioned introduction/link might look like this:

$ cat t277.cu
#include <iostream>
template <typename T>
struct GpuData {
    T x;
    T y;
    T z;
};

template <typename T, typename F>
__global__ void StructFunctor(GpuData<T>* in_dat, F f) {
    in_dat->x = f();
    in_dat->y += T{1};
};

int main(int argc, char** argv) {
    GpuData<float> c_dat {2, 3, 5};
    std::cout << "Input x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;

    GpuData<float>* g_dat;
    cudaMalloc(&g_dat, sizeof(GpuData<float>));
    cudaMemcpy(g_dat, &c_dat, sizeof(GpuData<float>), cudaMemcpyHostToDevice);
    StructFunctor<float><<<1, 1>>>(g_dat, [] __host__ __device__ ()->float{return 1.0f;});

    cudaMemcpy(&c_dat, g_dat, sizeof(GpuData<float>), cudaMemcpyDeviceToHost);
    std::cout << "Output x: " << c_dat.x << " y: " << c_dat.y << " z: " << c_dat.z << std::endl;
    return 0;
}
$ nvcc -std=c++11 t277.cu -o t277 --expt-extended-lambda
$ cuda-memcheck ./t277
========= CUDA-MEMCHECK
Input x: 2 y: 3 z: 5
Output x: 1 y: 4 z: 5
========= ERROR SUMMARY: 0 errors
$

(the __host__ decorator that I added to the lambda is not necessary in this particular case, but the __device__ decorator is.)

Note that I'm working off the original code you posted, not the modified version edited into your question by @einpoklum

Before asking others for help, if you are having trouble with a CUDA code, I usually recommend that you be sure to do proper CUDA error checking and run your code with cuda-memcheck. Even if you don't understand the output, it will be useful for those trying to help you.