synchronizing device memory access with host thread

Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn't seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):

#include <stdio.h>
#include <cuda.h>

__global__ void func(int *a, int N) {
    int idx = threadIdx.x;

    if (idx < N) {
        a[idx] *= -1;
        __threadfence_system();
    }
}

int main(void) {
    int *a, *a_gpu;
    const int N = 8;
    size_t size = N*sizeof(int);

    cudaSetDeviceFlags(cudaDeviceMapHost);
    cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
    cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);

    for (int i = 0; i < N; i++) {
        a[i] = i;
    }
    for (int i = 0; i < N; i++) {
        printf("%i ", a[i]);
    }
    printf("\n");

    func<<<1, N>>>(a_gpu, N);
    // cudaDeviceSynchronize();

    for (int i = 0; i < N; i++) {
        printf("%i ", a[i]);
    }
    printf("\n");

    cudaFreeHost(a);
}

I'm compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).

Solution

A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it's no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.

As described here:

In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:

Kernel launches;

Memory copies between two addresses to the same device memory;

Memory copies from host to device of a memory block of 64 KB or less;

Memory copies performed by functions that are suffixed with Async;

Memory set function calls.

This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don't want this behavior, a simple solution as you've already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don't need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it's copy operation.

I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It's possible to do this:

"Programmers can globally disable asynchronous kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should never be used as a way to make production software run reliably. "

If you want this synchronous behavior, it's better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it's really not a good idea.