Should cudaMemset work on the device pointer mapped from cudaHostRegister

I came across the sample code from one of my colleagues where the cudaMemset doesn't seem to work properly, when run on V100.

#include <iostream>
#include <stdio.h>
#define CUDACHECK(cmd) \
{\
    cudaError_t error  = cmd;\
    if (error != cudaSuccess) { \
        fprintf(stderr, "info: '%s'(%d) at %s:%d\n", cudaGetErrorString(error), error,__FILE__, __LINE__);\
          }\
}

__global__ void setValue(int value, int* A_d) {
     int tx = threadIdx.x + blockIdx.x * blockDim.x;
     if(tx == 0){
         A_d[tx] =  A_d[tx] + value;
     }
}

__global__ void printValue(int* A_d) {
     int tx = threadIdx.x + blockIdx.x * blockDim.x;
     if(tx == 0){
         printf("A_d: %d\n", A_d[tx]);
     }
}

int main(int argc, char* argv[ ]){
        int *A_h, *A_d;
        int size = sizeof(int);
        A_h = (int*)malloc(size);
        A_h[0] = 1;
        CUDACHECK(cudaSetDevice(0));
        CUDACHECK(cudaHostRegister(A_h, size, 0));
        CUDACHECK(cudaHostGetDevicePointer((void**)&A_d, A_h, 0));
        setValue<<<64,1,0,0>>>(5, A_d);
        cudaDeviceSynchronize();
        printf("A_h: %d\n", A_h[0]);
        A_h[0] = 100;
        printf("A_h: %d\n",A_h[0]);
        printValue<<<64,1,0,0>>>(A_d);
        cudaDeviceSynchronize();
        CUDACHECK (cudaMemset(A_d, 1, size) );
        printf("A_h: %d\n",A_h[0]);
        printValue<<<64,1,0,0>>>(A_d);
        cudaDeviceSynchronize();
        cudaHostUnregister(A_h);
        free(A_h);
}

When this sample is compiled and run, the output is seen as below.

/usr/local/cuda-11.0/bin/nvcc memsettest.cu -o test
./test
A_h: 6
A_h: 100
A_d: 100
A_h: 16843009
A_d: 16843009

We expect A_h and A_d to be set to 1 with cudaMemset. But it is set to some huge value as seen. So, is cudaMemset expected to work on the device pointer A_d returned by cudaHostGetDevicePointer. Is this A_d expected to be used only in kernels. We also see that cudaMemcpy DtoH or HtoD seem to be working on the same device pointer A_d. Can someone help us with the correct behavior.

Solution

We expect A_h and A_d to be set to 1 with cudaMemset.

You're confused about how cudaMemset works. Conceptually, it is very similar to memset from the C standard library. You should try your same test case with memset and see what it does.

Anyway, cudaMemset takes a pointer, a byte value, and a size in bytes to set, just like memset.

So your cudaMemset command:

    CUDACHECK (cudaMemset(A_d, 1, size) );

is setting each byte to 1. Since size is 4, that means that you are setting A_d[0] to 0x01010101 (in hexadecimal). If you plug that value into your windows programmer calculator, the value is 16843009 in decimal. So everything is working as expected, here, from what I can see.

Again, I'm pretty sure you would see the same behavior with memset for the same test case/usage.