Why is this CUDA program silently failing? (cudaMemcpyDeviceToHost always results in zeros)

I am trying to prove to myself that my CUDA development setup is working, but I am struggling to do so.

I have the following program which I simplified from: https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial02/solutions/vector_add_thread.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define N 10000000
#define MAX_ERR 1e-6

__global__ void write4(float *out, int n) {
    int index = threadIdx.x;
    int stride = blockDim.x;

    for(int i = index; i < n; i += stride){
        out[i] = 4.0;
    }
}

int main(){
    float *out, *d_out; 

    // Allocate host memory
    out = (float*)malloc(sizeof(float) * N);

    // cuda err
    cudaError_t err;

    err = cudaMalloc((void**)&d_out, sizeof(float) * N);
    if(err != cudaSuccess){
        fprintf(stderr, "Error: %s\n", cudaGetErrorString(err));
    }

    // Execute kernel 
    write4<<<1,256>>>(d_out, N);
    
    err = cudaDeviceSynchronize();
    if(err != cudaSuccess){
        fprintf(stderr, "Error: %s\n", cudaGetErrorString(err));
    }

    // Transfer data back to host memory
    err = cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
    if(err != cudaSuccess){
        fprintf(stderr, "Error: %s\n", cudaGetErrorString(err));
    }


    // print first 10 elements
    for(int i = 0; i < 10; i++){
        printf("%f\n", out[i]);
    }

    // Verification
    for(int i = 0; i < N; i++){
        assert(fabs(out[i] - 4.0) < MAX_ERR);
    }

    printf("PASSED\n");

    // Deallocate device memory
    cudaFree(d_out);

    // Deallocate host memory
    free(out);
}

The output I get:

$ ./a.out 
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
a.out: hello.cu:52: int main(): Assertion `fabs(out[i] - 4.0) < MAX_ERR' failed.
Aborted

I have an nvidia 4060 Ti and the OS is debian 12. I am compiling the program with nvcc using nvidia/cuda:12.6.2-devel-ubuntu24.04 docker image. It compiles without errors. I run the binary outside of the container, and again it runs without (cuda) errors until it gets to the verification step.

Any ideas? I'm pretty sure it's not an issue with the code because more complex examples from canned tutorials silently fail in the same way (all 0s copied from device to host). I'm just baffled that I get zero compile time or runtime errors, aside from my own assertions.

Solution

As @talonmies pointed out, I had incomplete error checking in the original snippet. Specifically, I wasn't checking if kernel launch was successful with something like:

write4<<<1,256>>>(d_out, N);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    fprintf(stderr, "CUDA error: %s: %s.\n", msg, cudaGetErrorString(err));
    return
}

Doing this revealed the following runtime error:

the provided PTX was compiled with an unsupported toolchain

Which then helped me quickly figure out I had pulled the wrong docker tag from https://hub.docker.com/r/nvidia/cuda -- the one I pulled had a newer CUDA version than my current driver.