I am trying to prove to myself that my CUDA development setup is working, but I am struggling to do so.
I have the following program which I simplified from: https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial02/solutions/vector_add_thread.cu
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define N 10000000
#define MAX_ERR 1e-6
__global__ void write4(float *out, int n) {
int index = threadIdx.x;
int stride = blockDim.x;
for(int i = index; i < n; i += stride){
out[i] = 4.0;
}
}
int main(){
float *out, *d_out;
// Allocate host memory
out = (float*)malloc(sizeof(float) * N);
// cuda err
cudaError_t err;
err = cudaMalloc((void**)&d_out, sizeof(float) * N);
if(err != cudaSuccess){
fprintf(stderr, "Error: %s\n", cudaGetErrorString(err));
}
// Execute kernel
write4<<<1,256>>>(d_out, N);
err = cudaDeviceSynchronize();
if(err != cudaSuccess){
fprintf(stderr, "Error: %s\n", cudaGetErrorString(err));
}
// Transfer data back to host memory
err = cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
if(err != cudaSuccess){
fprintf(stderr, "Error: %s\n", cudaGetErrorString(err));
}
// print first 10 elements
for(int i = 0; i < 10; i++){
printf("%f\n", out[i]);
}
// Verification
for(int i = 0; i < N; i++){
assert(fabs(out[i] - 4.0) < MAX_ERR);
}
printf("PASSED\n");
// Deallocate device memory
cudaFree(d_out);
// Deallocate host memory
free(out);
}
The output I get:
$ ./a.out
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
a.out: hello.cu:52: int main(): Assertion `fabs(out[i] - 4.0) < MAX_ERR' failed.
Aborted
I have an nvidia 4060 Ti and the OS is debian 12. I am compiling the program with nvcc
using nvidia/cuda:12.6.2-devel-ubuntu24.04
docker image. It compiles without errors. I run the binary outside of the container, and again it runs without (cuda) errors until it gets to the verification step.
Any ideas? I'm pretty sure it's not an issue with the code because more complex examples from canned tutorials silently fail in the same way (all 0s copied from device to host). I'm just baffled that I get zero compile time or runtime errors, aside from my own assertions.
As @talonmies pointed out, I had incomplete error checking in the original snippet. Specifically, I wasn't checking if kernel launch was successful with something like:
write4<<<1,256>>>(d_out, N);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "CUDA error: %s: %s.\n", msg, cudaGetErrorString(err));
return
}
Doing this revealed the following runtime error:
the provided PTX was compiled with an unsupported toolchain
Which then helped me quickly figure out I had pulled the wrong docker tag from https://hub.docker.com/r/nvidia/cuda -- the one I pulled had a newer CUDA version than my current driver.